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Abstract 

Evaluation  of  Multithreading  and  Caching 
in  Large  Shared  Memory  Parallel  Computers 

by 

Robert  Francis  Boothe 
Doctor  of  Philosophy  in  Computer  Science 
University  of  California  at  Berkeley 
Professor  Abhiram  G.  Ranade,  Chair 


Shared  memory  multiprocessors  are  considered  among  the  easiest  parallel  comput¬ 
ers  to  program.  However,  building  shared  memory  machines  with  thousands  of  processors 
has  proven  difficult.  Two  main  problems  are  the  long  latencies  to  shared  memory  and  the 
large  network  bandwidth  required  to  support  the  shared  memory  programming  style. 

In  this  dissertation,  we  quantify  the  magnitude  of  these  problems  and  evaluate 
multithreading  and  caching  as  mechanisms  for  solving  them.  Multithreading  works  by- 
overlapping  communication  with  computation,  and  caching  works  by  filtering  out  a  large 
fraction  of  the  remote  accesses. 


We  evaluate  several  multithreading  models  using  simulations  of  eight  benchmark- 
applications.  On  systems  with  multithreading  but  without  caching,  we  have  found  that  the 
best  results  are  obtained  for  the  explicit-switch  multithreading  model.  This  model  provides 
an  explicit  context  switch  instruction  that  allows  the  compiler  to  select  the  points  at  which 
context  switches  occur.  Our  results  suggest  that  a  200  cycle  memory  access  latency  can  be 
tolerated  using  multithreading  levels  of  10  threads  or  less  per  processor.  On  systems  with 
both  multithreading  and  caching,  we  have  found  that  the  switch-on-miss  multithreading  is 
best.  For  this  model,  our  results  suggest  that  a  200  cycle  memory  access  latencv  can  be 
tolerated  using  multithreading  levels  of  3  threads  or  less  per  processor. 

We  Sh°W  that  by  usinS  multithreading  techniques,  systems  both  with  and  without 
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Chapter  1 

Introduction 


Shared  memory  multiprocessors  are  considered  among  the  easiest  parallel  comput¬ 
ers  to  program  [Boo89,  HLRW92,  Ken92,  LM92,  LLG+92,  NL91,  RT86,  SHG92,  TKB92]. 
Programming  is  easier  because  the  shared  memory  programming  model  allows  the  pro¬ 
grammer  to  ignore  issues  such  as  the  explicit  location  of  data  and  its  movement  between 
processors.  This  model,  however,  is  just  an  abstraction,  and  its  success  depends  on  the 
ability  of  the  computer  hardware  and  software  to  efficiently  support  it.  This  is  analogous 
to  the  abstraction  of  a  large  virtual  memory. 

For  small  machines,  with  from  4  to  30  processors,  this  shared  memory  abstraction 
has  been  relatively  easy  to  provide.  It  involves  snooping  caches  on  a  single  memory  bus 
connecting  all  of  the  processors.  This  configuration  has  been  named  a  mti/ft[Bel85]  and 
has  been  widely  adopted  for  building  small  multiprocessors  for  which  a  single  bus  is  able 
to  provide  sufficient  bandwidth.  Examples  include  the  Sequent  Symmetry [LT88J,  Encore 
Multimax[Enc87],  and  Silicon  Graphics  4D-MP[BJS88]. 

However,  building  large  shared  memory  machines  has  proven  to  be  much  more 
difficult  than  building  other  types  of  large  parallel  machines.  For  example,  there  exist  1,024 
processor  (message  passing)  Ncube’s,  1,024  processor  (message  passing)  CM-5’s,  16,384  pro¬ 
cessor  (SIMD)  MasPar’s,  and  65,536  processor  (SIMD)  CM-2’s[DM93).  Large  commercial 
shared  memory  multiprocessors  such  as  the  KSRl[Ken92]  or  the  Cray-T3D[KS93]  have  only 
recently  been  introduced  and  have  not  yet  been  built  in  very  large  configurations.  The  goal 
of  this  dissertation  is  to  understand  and  address  the  key  difficulties  impeding  the  design 
and  development  of  large  shared  memory  multiprocessors. 
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1.1  The  Latency  Problem 

By  large  shared  memory  machines,  we  mean  hundreds  or  thousands  of  processors. 
For  machines  of  this  size,  the  bandwidth  of  a  single  bus  is  inadequate,  and  thus  more 
complex  networks,  such  as  butterfly  or  grid  networks,  are  required.  The  latency  problem 
arises  when  a  processor  accesses  a  shared  memory  variable  that  is  located  in  a  memory 
module  across  the  network.  To  perform  this  remote  memory  access,  the  processor  issues  a 
request  message  into  the  network.  This  message  then  traverses  the  network  to  the  memory 
module.  The  memory  module  reads  the  value.  And  then  it  sends  a  result  message  back  to 
the  requesting  processor.  The  interval  between  the  sending  of  the  request  message  until  the 
return  of  the  result  message  is  called  the  remote  memory  access  latency,  or  just  the  latency. 

The  networks  of  large  machines  are  multi-hop  networks,  and  messages  are  subject 
to  switching,  transmission,  and  congestion  delays  at  each  stage  of  the  network.  In  a  butterfly 
network,  for  example,  a  message  traverses  O(logp)  nodes  to  reach  its  destination,  and  in 
a  two  dimensional  grid  network  a  message  traverses  0(v/p)  nodes.  The  aggregate  latency 
through  these  networks  can  be  hundreds  of  cycles.  The  latency  becomes  a  problem  if  the 
processor  spends  a  large  fraction  of  its  time  sitting  idle  waiting  for  remote  accesses  to 
complete. 

Figure  1.1  shows  the  extrapolated  round  trip  network  latencies  (expressed  in  terms 
of  the  processor’s  cycle  time)  for  several  recent  or  proposed  large  parallel  machines.  These 
machines  have  a  variety  of  architectures.  The  CM-5[LAD+92]  is  a  message  massing  ma¬ 
chine  with  a  fat-tree[Lei85]  network.  DASH[LLJ+  92]  is  a  cache- coherent  shared- memory 
multiprocessor  with  a  two-dimensional  toroidal  mesh  network.  The  KSRl[Ken92]  is  also  a 
cache- coherent  shared-memory  multiprocessor  but  with  a  ring  (or  hierarchy  of  rings)  net¬ 
work.  And  TERA[ACC+90]  (expected  in  1994)  is  a  shared-memory  multiprocessor  without 
caching,  and  it  uses  a  three-dimensional  toroidal  mesh  network. 

The  latencies  shown  in  the  graph  have  been  extrapolated  based  on  scaling  these 
networks.  For  the  CM-5  and  DASH,  the  latencies  do  not  include  congestion  affects,  and 
thus  the  actual  latencies  in  heavily  loaded  networks  will  be  higher  than  these  curves.  The 
KSR1  has  not  yet  been  disclosed  well  enough  to  allow  extrapolating  a  complete  latency 
curve. 

For  machines  supporting  1024  processors,  these  graphs  suggest  that  we  can  expect 
latencies  of  200  cycles  or  more,  once  congestion  affects  are  taken  into  account.  Furthermore, 
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Figure  1.2:  Mechanisms  for  reducing  the  impact  of  memory  latency. 


different  thread  while  waiting  for  a  remote  reference  to  complete.  Weak  consistency[AH90] 
allows  some  overlapping  of  writes  without  concern  that  they  might  arrive  out  of  order. 
Prefetching[LYL87]  allows  requesting  data  before  it  will  be  needed.  By  layout[Hig93]  we 
mean  the  idea  of  arranging  data  on  or  near  the  processor  that  is  going  to  use  it.  And 
aggregation [HLRW92]  is  the  idea  of  getting  large  amounts  of  data  at  once. 

The  mechanisms  near  the  top  of  the  diagram  are  more  commonly  automatic  (or 
invisible)  as  far  as  the  programmer  is  concerned  and  are  generally  implemented  in  hard¬ 
ware.  The  mechanisms  near  the  bottom  of  the  diagram  are  often  implemented  in  software 
either  by  a  smart  compiler  or  manually  by  the  programmer.  For  example  in  a  message 
passing  program,  the  programmer  explicitly  specifies  the  layout  of  data  and  the  packaging 
of  messages(i.e.,  aggregation  of  data). 

All  of  these  mechanisms  have  their  limitations.  Caches  must  be  kept  coherent, 
which  becomes  complex  for  large  machines[HLRW92,  TD91].  Furthermore,  the  hit  rates 
may  be  low  for  accesses  to  shared  data[DRPS87,  GHG+91j.  Multithreading  requires  com¬ 
plex  hardware  to  allow  rapidly  switching  between  the  threads  on  a  processor.  And  since  it 
requires  extra  threads,  it  also  requires  extra  parallelism  and  is  thus  limited  to  larger  prob¬ 
lems.  Consistency  models  prohibit  many  compiler  optimizations.  Weak  consistency  allows 
more  than  sequential  consistency,  but  it  is  a  less  intuitive  programming  model[GLL+90]. 
Prefetching  is  useful  for  applications  with  predictable  behavior  such  as  many  scientific  codes, 
but  it  is  of  limited  applicability  for  more  chaotic  codes  that  use  complex  data  structures.  It 
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Figure  1.3:  Evolution  of  multithreading  models 


also  may  waste  bandwidth  by  prefetching  data  that  is  not  used.  Generation  of  good  layouts 
■s  limited  to  the  more  regular  and  predictable  applications.  Finally,  aggregation  requires 
coarse  gram  parallelism  where  large  data  items  can  be  manipulated. 

In  this  dissertation  we  have  chosen  to  focus  on  evaluating  the  mechanisms  of 
mult, threading  and  caching.  A  weakly  consistent  memory  model  is  assumed  throughout. 
These  are  the  more  automatic  mechanisms  and  are  most  consistent  with  the  shared  memory 
programming  model.  The  mechanisms  of  prefetching  and  layout  can  be  of  additional  benefit 
and  we  have  incorporated  them  in  a  limited  fashion  in  a  few  of  our  studies,  however  there 
remains  room  for  further  research  in  these  areas. 


1.3  Overview  of  Previous  Multithreading  Work 

Previous  multithreading  research  has  been  motivated  by  three  concerns:  tolerating 
memory  latency,  building  a  fast  pipeline,  and  supporting  a  dynamic  dataflow  like  execution 
mo  e  .  Figure  1.3  shows  the  evolution  of  multithreading  models  and  some  of  the  motiva- 
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«ioDS  for -moving  from  one  model  to  another.  Somc  of  theK  ^  ^  ^  ^  ^ 
previously  but  can  be  predicted  based  on  the  motivations. 

.,  ,  Tkecos“  “Concerns  of  multithreading  are:  the  large  number  of  threads  needed 

he  scheduling  mechanism  used  ,o  schedule  ,he  many  threads  on  a  processor,  the  cycles  lost’ 
to  context  switching  overhead,  and  the  iarge  register  ffle.  These  costs  and  concerns  21 
nfluenced  by  when  and  how  often  context  switching  is  performed. 


1*3.1  Fast  Pipeline 

The  oldest  model  switch-every-cycle  was  used  in  the  Denelcor  HEP  rKow85l 
and  m  MASA  [HFSS).  After  each  instruction,  the  processor  switches  to  a  difteren,  ,^ 

ins.  r  •  a  PiPe“M  '°  be  b“il‘  be<:a'1Se  “  dtah—  d*‘“  pendencies  between 
ructions  m  the  p.peiine  by  in.erieaving  different  threads.  I.  also  allows  memory  latencies 

o  be  .derated  by  no,  scheduling  a  thread  until  its  reference  has  completed.  Unfortunately 

his  mode,  requires  a  ,arge  number  of  threads  and  a  large  amount  o,  hardware  to  support 

em.  Abo,  by  interleaving  the  instructions  from  many  threads,  a  single  thread  is  limited 

o  a  small  fraction  of  the  processing  power.  TERA[ACC+90]  is  similar  to  the  HEP,  but  the 

bedding  pohey  has  been  changed  so  that  a  thread  can  issue  more  than  one  reference  into 
the  network  before  waiting  for  the  results. 


1.3.2  Hiding  Memory  Latency 

The  rest  of  the  multithreading  models  that  we  consider  execute  a  thread  for  many 
cycles  before  context  switching.  The  optimizing  compiler  is  responsible  for  the  ordering  of 

;ns  ructions  so  as  to  hide  the  small  pipeline  delays,  and  context  switches  are  thus  used  only 
to  hide  the  long  memory  latency  of  remote  accesses. 

memo  ™e;7‘Ch-°”-load  modd  l°ad  instructions  which  access  shared 

bTTn  n  mCm°ry  ““  °,her  complete  quickly  and  can  be 

sc  ednled  by  the  compile.  Shared  memory  stores  do  no,  wait  for  their  coition  and 

ore  do  no,  cause  context  switches  either.  The  advantages  of  this  model  over  switch- 

ZTZ  rTa‘  n  r  thrCad  ““  at  M  speed  switches,  and 

total  threads  will  be  needed  since  multithreading  is  no,  being  used  to  hide  pipeline 

e  ays.  Simpler  hardware  may  also  be  possible  since  context  switches  are  less  frequent 

The  switch-on- load  model  sometimes  context  switches  sooner  than  i,  needs  to 
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of  the  uses)  to  wait  for  all  of  the  loads  in  the  group  ^  ^  ** 

load  two  values  from  shared  memory  and  th  ’  SlmP  6  C°mputatlon  may 
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An  alternative  method  for  grouping  shared  loads  together  is  to  add 
text  switch  instruction  between  the  eroun  r>f  1  ,  ,  ,  to  add  an  exphcit  con- 

..  »  group  of  loads  and  their  subsequent  ikpc  . 

sw,tch  model  aUow,  similar  gr„„pille  to  ...  .  k  .  eqU“'  ”SeS'  The  * 

requires  the  addition  of  only  a  sinirl  '  .  •  US<i’  ”*  “  slmpler  to  implement  and 

OI  only  a  single  instruction  We  Pvali,=>+o  tu  . 

in  Chapter  4  and  find  that  it  ran  r  •  ,  evaIuate  the  explicit-switch  model 

^unnotnat  it  can  eliminate  from  50%  to  80%  nf  lb*.  *  a  . 

by  the  switch-on-load  model.  ontext  switches  needed 

The  most  recent  data  flow  research[CSS+91  NPA921  ha?  a  j  , 
switch  model.  Short  threads  execute  until  th  •  ,  P  ^  GXplicit- 

context  switch  to  a  new  thread.  C°mP  ‘°”  WhiCh  POi“‘  ««  • 


1.3.3  Adding  Caches 
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apphcations  jus,  2  or  3  threads  per  processor  is  sufficient  m°S‘ 
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The  switch-on-miss  model  switches  at  points  where  load  instructions  miss  in  the 
cache.  An  early  study  of  this  by  Weber  &  Gupta[WG89]  suggested  substantial  performance 
benefits  were  available,  but  a  later  study  as  part  of  the  DASH  project  [GHG+91]  had 
less  optimistic  results.  Switch-on-miss  multithreading  was  also  studied  as  part  of  the 
ALEWIFE  project  [ALKK90]  and  achieved  good  results  for  a  few  simple  applications.  One 
draw  back  of  context-switching  on  cache  misses  is  that  the  context  switch  is  detected  after  a 
number  of  subsequent  instructions  have  started  down  the  CPU  pipeline.  These  instructions 
must  be  canceled,  and  thus  there  will  be  a  context  switch  cost  of  several  cycles  because  of 
the  wasted  pipeline  slots. 

The  switch-on-use-miss  model  context  switches  when  a  use  instruction  tries  to 
use  the  value  from  a  shared  load  that  missed  in  the  cache.  It  was  studied  (approximately) 
by  the  DASH  project[GHG+91]  when  they  looked  at  the  combination  of  prefetching  and 
multithreading.  Their  prefetch  instructions  act  like  the  initial  load  instructions,  and  their 
subsequent  load  instructions  act  like  the  use  of  the  data.  They  found  little  benefit  from 
prefetching  when  combined  with  multithreading,  however  they  state  that  their  prefetching 
method  was  meant  for  a  single  threaded  processor  and  should  be  done  differently  for  a 
multithreaded  processor. 

The  conditional-switch  model  adds  caching  to  the  explicit-switch  model.  The 
code  appears  the  same  as  that  for  the  explicit-switch  model:  there  is  a  group  of  load 
instructions,  followed  by  a  context  switch  instruction,  followed  by  the  instructions  that 
use  the  loaded  data.  The  difference  is  that  the  context  switch  instruction  is  treated  as  a 
conditional  switch  instruction.  If  any  of  the  loads  preceding  the  switch  instruction  missed 
in  the  cache,  a  context  switch  is  performed  as  expected.  But  if  all  of  the  preceding  loads  hit, 
the  context  switch  instruction  is  ignored  and  the  thread  continues  executing.  This  model 
provides  the  benefits  of  grouping  and  caching  as  in  the  switch-on- use-miss  model,  but  it 
may  be  simpler  to  implement. 


1.4  Limited  Bandwidth 

Besides  having  long  latencies  on  remote  accesses,  the  networks  on  large  parallel 
machines  are  also  likely  to  have  limited  bisection  bandwidths2.  Figure  1.4  shows  the  bi- 

2  Bisection  bandwidth  is  defined  as  the  minimum  bandwidth  capacity  between  the  two  halves  of  a  bisected 
machines,  considering  all  possible  bisections. 


Bandwidth  (bits  /  op) 
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Figure  1.4:  Bisection  bandwidth  of  various  parallel  computers  as  a  function 
of  machine  size.  The  bandwidth  is  expressed  in  terms  of  bits  per  processor 
operation  at  the  peak  capacity  and  peak  execution  rate. 
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section  bandwidths  of  various  parallel  computers.  These  are  peak  bandwidths  and  are  not 
expected  to  be  fully  achieved  under  real  traffic  patterns.  Also,  the  bandwidths  have  been 
extrapolated  to  1024  processors  based  on  the  proposed  designs,  even  if  such  a  large  machine 
has  not  actually  been  built.  The  bandwidth  values  were  calculated  based  on  descriptions 
of  the  networks  in  [ML92],  [Ken92],  [LLJ+92],  [LAD+92],  and  [ACC+90]. 

The  key  point  of  this  graph  is  that  for  most  networks,  the  bisection  bandwidth 
drops  as  the  number  processors  is  increased,  and  that  for  a  large  (1024  processor)  machine, 
only  1  or  2  bits  of  bandwidth  per  operation  will  be  available.  In  fact,  achievable  bandwidth 
may  be  only  half  of  that  amount  because  of  the  congestion  caused  by  irregular  traffic 
patterns[Dal90]. 

The  reason  the  bandwidth  drops  off  as  the  number  of  processors  increases  is  related 
to  the  scaling  characteristics  of  the  networks.  For  the  Sequent [ML92],  there  is  a  single 
shared  bus  and  thus  the  bandwidth  per  processor  diminishes  in  proportion  to  the  number 
of  processors.  For  both  bandwidth  and  electrical  reasons,  sharing  a  single  bus  limits  the 
number  of  processors  to  around  30. 

The  KSRl[Ken92]  is  similar,  it  uses  a  single  high  bandwidth  ring  for  small  ma¬ 
chines,  or  a  two  level  ring  of  rings  for  larger  machines.  The  bisection  bandwidth  depends 
only  on  the  top  level  ring,  and  when  calculated  on  a  per  processor  basis,  decreases  linearly. 
For  large  machines  they  stave  off  the  bandwidth  decline  by  providing  multiple  rings  at  the 
top  level.  The  three  lines  in  Figure  1.4  for  the  KSR1  represent  the  three  configuration 
options  for  these  top  level  rings.  The  largest  option  has  4  GB/sec  of  bandwidth  along  the 
ring  (8  GB/sec  crossing  the  bisection),  but  when  divided  among  1024  processors  (two-way 
superscalar)  running  at  20  Mhz,  this  provides  only  1.6  bits  per  operation. 

The  DASH[LLJ+92]  architecture  scales  better  because  it  is  based  on  a  2-D  wrap¬ 
around  mesh  (torus)  rather  than  a  ring.  The  bisection  bandwidth  per  processor  drop  off  as 
the  square  root  of  the  number  of  processors.  At  1024  processors,  which  is  more  than  this 
design  was  meant  for,  the  bisection  bandwidth  is  1.8  bits  per  operation. 

The  CM-5  network[LAD+92j  is  a  fat  tree.  Fat  trees[Lei85]  are  a  family  of  networks 
where  the  connections  between  nodes  at  higher  levels  of  the  tree  are  generally  “fatter”  than 
the  connections  between  nodes  at  lower  levels  of  the  tree.  With  the  appropriate  connection 
widths,  a  fat  tree  can  provide  constant  bisection  bandwidth  per  processor  as  the  machine 
is  scaled.  However  to  save  costs,  the  designers  chose  to  eliminate  many  of  the  channels  at 
the  higher  levels  in  the  tree.  For  1024  processors,  the  bisection  bandwidth  is  2.5  bits  per 
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operation.  This  figure  is  without  the  optional  vector  units.  If  they  are  included  and  achieve 
their  potential  factor  of  4  performance  increase,  the  bandwidth  when  expressed  in  bits  per 
operation  will  be  reduced  to  only  0.6  bits. 

Finally,  and  in  contrast  to  the  other  networks,  the  proposed  Tera  network  pro¬ 
vides  a  bisection  bandwidth  of  55  bits  per  operation  and  scales  the  bandwidth  linearly 
with  the  number  of  processors.  The  network  is  a  sparsely  populated  3-D  wrap-around 
mesh[ACC+90],  and  to  scale  the  bandwidth  linearly,  they  increase  the  number  of  network 
nodes  faster  than  the  number  of  processors.  For  large  machines  Tera  has  more  than  an  order 
of  magnitude  greater  bandwidth  than  other  machines.  We  suspect  however,  that  providing 
this  large  network  bandwidth  may  not  prove  cost  effective. 

In  this  thesis  we  do  not  focus  on  any  particular  network  topology.  Instead  we 
measure  the  bandwidth  needs  of  our  benchmark  applications,  and  then  use  the  results  to 
reason  about  the  types  of  machines  that  should  be  built  and  the  bandwidth  capacity  that 
they  should  supply. 

1.5  Overview  of  Thesis 

In  this  dissertation  we  concentrate  on  the  switch-on-load,  explicit-switch, 
switch-on-miss,  and  conditional-switch  models.  If  caches  are  not  used,  our  results 
will  show  that  grouping  is  important  and  thus  explicit-switch  is  preferable  to  switch-on¬ 
load.  However  if  caches  are  used,  our  results  will  show  that  grouping  has  little  benefit  and 
thus  switch-on-miss  is  preferable  to  conditional-switch. 

The  remainder  of  this  dissertation  is  organized  as  follows:  Chapter  2  discusses 
our  simulation  methodology  and  our  set  of  benchmark  applications.  Chapter  3  presents 
a  performance  model  for  a  multithreaded  processor.  Chapter  4  focuses  on  hiding  latency 
with  multithreading  and  evaluates  the  switch-on-load  and  explicit-switch  multithread¬ 
ing  models.  Chapter  5  adds  coherent  caching  to  the  system  and  evaluates  the  switch-on- 
miss  and  conditional-switch  multithreading  models.  Chapter  6  considers  the  problem  of 
limited  network  bandwidth  and  presents  results  on  the  amount  of  bandwidth  that  is  needed 
by  the  various  applications  and  multithreading  models.  Chapter  7  presents  miscellaneous 
studies  and  experiments  on  synchronization  and  various  caching  issues.  Chapter  8  discusses 
the  hardware  mechanisms  needed  to  support  a  multithreaded  processor.  And  Chapter  9 
presents  conclusions  and  directions  for  future  research. 
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There  are  also  two  appendices:  Appendix  A  explains  a  new  method  for  plotting 
distributions  that  we  have  introduced  in  order  to  visually  present  both  clearly  and  com¬ 
pactly  the  types  of  distributions  that  we  have  encountered.  And  Appendix  B  explains  the 
techniques  used  to  build  the  simulator  that  made  this  research  possible. 
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Chapter  2 

Methodology 


This  chapter  discusses  our  simulation  based  research  nwW  i 
7  m°del  *  *  — y  parallel  machine.  Then  we  preset 

r::r to  - — - — ~  “  : 

complex  ^::::rr:nr:  r?rors  ™ 

«/.[SBCvE90,  SBC91]  and  AgarwalfAga92]  hav  H  ,  Saavedra- Barrera  et 

“g  —  -  -  reaeoneo ^“777^  "  “ 
are  independent  an,  have  access  a,  exponent^ 

Poisson  process).  The  threads  nf  n^n  i  according  to  a 

«tey  have  sube.an.ial  sharing  data  ant  Z“'  T”'  ”  °°‘  *“*" 

tag.  We  will  also  see  in  Section  3  2  ’.ha.  ^ “chrmu«““”  *■>  coordinate  this  shar- 

— —  -  ~  cr  ;:aT  :r~ 

ra.es,  amount  of  sharing,  synchronisation  patterns,  imperfect  load  bL 
•tan  overhead,  and  time  Wying  behavior,  that  are 

well  understood.  characterize  and  are  not  yet 


2-l  Machine  Model 

of  a  numb^ocet::;;;  ail?  ?ig; shired  •  *• 

connected  by  a  switching  network  fill  ”  °  ^  ***  ^  inter' 

g  network.  Each  processor  also  has  a  local  memory  which  holds 


15 


Local  Memory 
Processors 
(optional  Caches) 


Shared  Memory 


Figure  2.1:  Model  of  large  shared  memory  multiprocessor. 


variables  local  to  threads,  stacks,  and  the  code.  We  assume  all  accesses  to  local  memory  are 
instantaneous.  This  is  reasonable  since  local  data  and  instructions  can  be  easily  cached,  and 
any  misses  can  be  serviced  locally.  Accesses  to  shared  memory  are  sent  across  the  network 
and  thus  have  a  long  latency  before  they  return. 

We  look  at  two  variations  of  this  model:  one  with  caching  of  shared  data  (as 
shown  in  the  figure),  and  the  other  without  it.  Both  types  of  systems  have  been  and 
are  being  built.  Examples  of  systems  without  coherent  caching  are  the  HEP[Kow85], 
BBN  Butterfly[BBN89],  and  Tera[ACC+90j.  Systems  with  coherent  caching  include  all 
of  the  shared  bus  based  systems  such  as  the  Sequent  [Ost89]  as  well  as  the  more  scal¬ 
able  KSRl[Ken92]  and  research  projects  such  as  DASH[LLG+90,  LLG+92,  LLJ+92]  and 
ALEWIFE[ALKK90j. 

For  a  real  machine,  the  model  in  Figure  2.1  will  likely  be  folded  back  upon  itself  so 
that  each  processor  is  directly  connected  to  one  of  the  shared  memory  modules.  This  would 
give  each  processor  direct  access  to  a  small  portion  of  shared  memory.  If  the  programmer 
(or  compiler)  can  control  the  layout  of  data  onto  the  memory  modules,  she  might  be  able 
to  arrange  the  data  on  or  near  the  processors  where  it  will  be  used.  Such  layouts  could 
eliminate  a  large  fraction  of  the  remote  references,  but  they  are  not  possible  for  many 
applications[SHG92], 

In  this  research  (with  our  applications,  programming  languages,  and  compiler  tech¬ 
nology)  we  do  not  have  the  capability  of  customizing  the  data  layout  for  each  application. 
We  therefore  assume  that  data  is  randomly  interleaved  across  the  memory  modules.  This 
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interleaving  is  done  at  the  level  nf 

caches  the  1  ,  ®  ^  aC“SS  Unit-  F°r  sy“ems  without 

hes,  largest  memory  access  unit  is  a  doub.e  word;  for  system,  with  caches,  the 
memory  access  unit  is  a  cache  line. 

Without  optimization  of  the  data  layout,  the  performance  advantage  of  the  folded 
machme  configurate.  is  small.  On  a  1000  processor  machine,  for  example,  only  1/1000, h 
O  e  accesses  will  be  to  the  locally  accessible  memory  module.  This  small  factor  is  in 

we  s,udy  ,he  m°dei  -  —  - — -  jz 

Mellor-Crummey  and  Sco„[MCS91]  argue  against  building  such  'dance  hall-  ma¬ 
lt":,::  effiClent  techniques  depend  upon  having  either  coher- 

caching  or  local  access  to  par,  of  shared  memory.  In  Section  7.1  we  propose  preferable 

ynchromzation  techniques  that  will  eliminate  this  taboo  on  “dance  hall”  machines. 

2.1.1  Network 

There  are  many  proposed  network  topologies’.  Figure  2.2  shows  some  that  are 
popular  for  reasons  involving;  latency,  bandwidth,  cost,  modularity,  and  amiability  of  sim 

6  r°Utmg  ^  “  «  »  -ive  research  area  with  many  competing 

ncerns.  In  tins  research  we  do  not  select  any  specific  network,  bu,  instead  we  focus  on  the 

171  ^  °!Sli  °f  ,heSe  netW°rk8'  FiKt’  *hey  ^  “  Pacte  ««works 

Second  7,hTy  refere“eS  ‘°  ^  ‘raVebag  the  -»«*  simultaneously. 

eren  1  ^  **  Prc“SS°r  kave  several  outstantog  ref- 

I  t  “T  "  "d  ,tird’  refercnC6S  WiU  ^  ^cause  the 

are  routed  through  many  network  nodes  and  experience  congestion  and  delays  along  te 

In  this  research  we  are  interested  in  machines  which  range  in  size  from  a  hundred 

::::::::: a  “ p— we  -  -  -  a  1000  prone," 

the  1  ,  c  PaIlmeterS  “d  f“rther  reaSOni"g-  Figure  U  ‘a  Cl.ap.er  1  showed 

latenc.es  of  several  existing  and  proposed  large  interconnection  networks.  For  a  1000 

tocessor  machine,  these  networks  a  have  round  trip  latencies  in  the  range  of  a  ,  w 
^processor  cycles.  We  choose  a  latency  of  200  cycles  as  representative  of  these  figures 

separated  at  different  ends  of  the  n^r^and^d^^  the  Pr0Ce^°r8  and  memories 

See  for  example:  Almasi/Gottlieb[AG89]  chapter  8.  ^P^ted  boys  and  girls. 
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and  focus  this  research  on  tolerating  latencies  of  this  magnitude. 

We  model  the  network  and  shared  memories  simply  as  a  black  box  that  takes  200 
cycles  to  respond  to  a  remote  memory  reference.  In  a  real  network,  much  of  this  latency  will 
come  from  delays  due  to  minor  random  congestion,  and  these  small  delays  are  an  expected 
component  of  the  200  cycle  latency.  A  more  difficult  case  is  severe  congestion  caused,  for 
instance,  by  a  program  induced  hot  spot  in  which  every  processor  trie,  to  simultaneously 
access  the  same  memory  location.  In  this  case  delays  can  become  much  longer  than  the  delay 
of  200  cycles  that  we  have  assumed.  We  ignore  such  congestion  initially,  but  in  Section  6  2 
we  assess  the  frequency  of  such  hot  spots  and  their  impact  on  our  simulation  results. 

2.1.2  Processor 

We  expect  that  the  processors  used  in  parallel  systems  will  be  the  same  or  very 
similar  to  the  microprocessors  used  in  high  performance  workstations.  This  is  because  the 
peak  performance  of  a  parallel  system  is  the  product  of  the  performance  of  a  single  processor 
and  the  number  of  processors.  Such  a  large  development  effort  is  put  into  the  race  for  the 

highest  performance  microprocessor  that  these  push  the  technology  curve  and  offer  the  most 
cost  effective  single  processor. 

To  tolerate  latency,  however,  we  evaluate  multithreading  techniques  which  require 
that  the  processor  be  able  to  context  switch  rapidly  between  threads.  In  the  past  mul¬ 
tithreaded  processors,  such  as  the  HEP[Kow85],  have  involved  very  different  and  complex 
processor  designs  that  context  switch  every  cycle  and  use  a  a  large  number  of  threads.  In¬ 
stead  we  look  at  multithreaded  processors  that  are  similar  to  today’s  RISC  microprocessors 
with  the  addition  of  being  able  to  context  switch  on  long  latency  remote  memory  accesses. 
Chapter  8  discusses  the  hardware  issues  in  detail.  Here  we  wish  simply  to  specify  our  as¬ 
sumptions  about  the  multithreaded  processors  that  we  simulate,  and  leave  their  justification 
to  Chapter  8. 

We  assume  the  same  instruction  set  and  instruction  timings  as  the  MIPS 
M000[Kan89],  but  with  a  few  modifications.  Most  importantly,  we  assume  that  the  register 
Me  has  been  replicated  on  the  chip  enough  times  so  that  each  thread  running  on  the  chip 
can  have  its  own  set  of  registers.  Because  the  registers  are  on  chip  and  do  not  have  to  be 
saved  or  loaded  from  memory  on  a  context  switch,  the  processor  should  be  able  to  switch 
qmckly  between  threads;  in  some  cases  as  fast  a  single  cycle  (see  Section  8.1.1). 
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Another  modification  is  that  we  have  added  double  word  loads  and  stores  to  the 
instruction  set.  Many  floating  point  numbers  are  stored  as  double  words,  and  it  is  crucial 
(when  the  network  has  long  latencies)  to  get  the  whole  thing  at  once  rather  than  having 
two  separate  references  as  is  done  on  the  MIPS  R3000.  More  recent  machines,  such  as  the 
MIPS  R4000,  all  provide  double  word  loads  and  stores. 

Finally,  we  provide  both  local  and  shared  versions  of  all  memory  access  instruc¬ 
tions.  This  is  based  on  the  assumption  that  memory  references  can  be  classified  by  the  com¬ 
piler  as  either  local  or  shared.  For  instance  references  to  locations  in  a  shared  array  would 
use  shared-load  instructions  while  references  to  local  variables  would  use  local-load  in¬ 
structions.  This  compiler  classification  may  not  be  possible  in  the  case  of  pointers  if  it  is 
unclear  what  is  pointed  to  and  whether  or  not  it  resides  in  shared  or  local  memory.  We  call 
these  unclear  cases  ambiguous  pointers,  and  they  must  be  resolved  at  run-time  either  with 
extra  code  or  special  hardware,  which  will  likely  slow  down  and/or  complicate  the  machine. 
Ideally  we  would  like  the  compiler  to  classify  as  many  references  as  possible  because  this 
information  will  be  needed  for  compiler  optimizations  in  Chapter  4. 

2.1.3  Programming  Language 

Our  applications  are  written  in  the  augmented  C  dialect  that  is  used  in  writing 
shared  memory  programs  on  the  Sequent[Ost89].  Figure  2.3  shows  an  example  of  a  simple 
program  that  multiplies  two  matrices.  The  arrays  d,  e,  and  f  are  declared  as  residing  in 
shared  memory  by  the  addition  of  the  type  modifier  “shared”  before  their  declaration. 

Unfortunately  this  language  does  not  have  shared  memory  declarations  for  objects 
accessed  indirectly  via  pointers.  The  compiler  does  not  know  that  the  parameters  a,  b,  and 
c  to  the  worker  function  will  be  arrays  in  shared  memory.  For  this  simple  program  the 
compiler  might  deduce  this  information  through  global  analysis,  but  in  the  general  case  this 
is  difficult. 

In  our  simulations  we  at  first  used  dynamic  testing  of  pointers  to  determine  if 
addresses  were  in  local  or  shared  memory.  Later  we  observed  that  for  our  application  pro¬ 
grams,  true  ambiguous  cases  (where  sometimes  a  pointer  points  to  a  local  location  but  at 
other  times  points  to  a  shared  location)  never  occurred.  We  thus  collected  classification 
information  from  an  initial  run  of  the  application  and  fed  it  back  into  subsequent  compi¬ 
lations,  as  is  done  in  trace  analysis.  This  allowed  complete  compiler  classification  of  all 
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“ultiplTeTTJo'^tTic'e's' 

shared  double  d  [1 01  n  01  nmr 

‘■lOJtlO],  e[l0][10J,  f[10][10];  - */ 

aain() 

<  /*  sill  compute  f  =  d  *  e  */ 

I*’  “itial«e:  d  and  e  . 

a_set_procs(l00) • 

■-fork (worker ,  d’  e.  f).  '*  ®et  to  *°0  threads  */ 

j  "•  Print  result;  f  '  fori  the  threads  */ 


"°rker(a,  b,  c) 

{  b[10Kl0],  c [102 [10] ; COnput®  '  *  «  *  b  ./ 

lnt  i>  3.  k,  my  id; 

double  sum; 


ay id  a  a_get_myid() ; 
1  -  ayid/10; 
j  =  ayidXlO; 


/*  id  to->°°>  */ 
/*  ,-=7  ,  thread’s  row  */ 
calculate  thread’s  col  */ 


sum  *  o.O;  -  C°X  V 

for  Ck  a  0;  k  <  10;  k++) 

SUB  +=  aQjcy  ,  b  ' 

' “  W  *  "*=  /,  «  products  ./ 

/*  on  calculates  */ 

'  one  sisaent  of  c  */ 


Figure  2.3:  Example  of  a  shared  memory  program. 
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Application 

Lines 

Cycles 

Description  &  Problem  size  " 

sieve 

236 

106  M 

counts  primes 

number  of  primes  <  4,000,000 

blkmat 

369 

87  M 

blocked  matrix  multiply 

200  x  200  matrices 

sor 

333 

258  M 

successive  over  relaxation 

192  x  192  grid 

ugray 

10784 

1353  M 

ray  tracing  graphics  Tenderer 

gears  (7169  faces),  20  x  512  slice  of  image 

water 

1368 

1082  M 

simulate  a  system  of  water  molecules 

343  molecules,  2  iterations 

locus 

6347 

665  M 

route  wires  in  a  standard  cell  circuit 
Primary 2  (1290  cells  x  20  channels) 

mp3d 

1510 

192  M 

simulate  rarefied  hypersonic  flow 

100,000  particles,  10  iterations 

barnes 

2109 

1148  M 

gravitational  N-body  simulation 

4096  bodies  in  two  clusters 

Table  2.1:  Parallel  Applications 


references. 

Although  we  were  able  to  obtain  complete  classification  information  through  trace 
analysis,  we  would  like  to  advocate  that  this  shared  versus  local  distinction  is  important  and 
that  it  should  be  supported  explicitly  by  future  shared  memory  parallel  languages.  This 
could  be  done  by  allowing  declarations  of  parameters  (such  as  a,  b,  and  c  in  the  example) 
and  pointers  as  pointing  to  shared  memory. 


2.2  Benchmark  Applications 

Table  2.1  shows  the  eight  benchmark  applications  used  in  this  research.  These 
are  all  scientific  programs  that  perform  some  computation  or  numeric  simulation.  The  first 
three  (sieve,  blkmat,  and  sor)  are  toy  applications  written  as  part  of  this  research.  The 
other  five  are  real  applications.  Ugray[Boo89]  was  parallelized  by  myself,  and  has  been  used 
in  a  few  parallelism  studies[BR92,  LS91,  0’K92j.  The  last  four  (water,  mp3d,  locus,  and 
barnes)  are  part  of  the  Stanford  SPLASH  benchmark  set[SWG92]  and  have  been  used  in 
many  studies,  especially  those  associated  with  the  DASH  project [LLJ+ 92]. 

Each  of  the  applications  has  some  unique  behavioral  characteristic(s),  and  the 
three  toy  applications  were  chosen  because  they  each  have  distinct  behaviors  that  broaden 
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known  region 


0  =  prime 
*  =  not  prime 


^  2A:  ^  3  P^el  Primes  finder. 


iiie  scope  of  the  totai  benchmark  set  Th  r„ 

each  application  and  reoort  th  ■  following  sections  contain  brief  4  • 

“  “  ^  “<*  mean  „  plys.“°”  -  •* 

o  «  ,  ^  ysjcai  processor. 

^•^.1  Sieve 

The  Sieve  application  finds  anH 

some  given  number.  Figure  2  4  i,  C°UDtS  the  DUmber  of  Primes  that  i 

tion  It  ra  2  Sb°Ws  bow  ^s  algorithm  ^  are  less  th*n 

“■  ‘  repreSeMs  -»mber  space  by  a  bit  .  ,  Par,i,i°“«i  for  Parallel  «eca 

—mber  (,le  number  2  .  '  *  b‘*  »  shared  memory  wili  OM  “eC- 

*“  “»»«  might  be  prime  A  t:Para,eW-  *"  bte  are  0,  w^l  ! 

to  nnt  k  •  *  me*  the  sieve  evpmt^o  i_  tuch  means 

6  Pnme’ its  fci‘  »  set  to  1.  Wienever  3  -mber  is  determined 

it  is  adequate^o  111616  “  *  "S*™  °f  inow-  Primes  If  th!  ■ 

uakan  COmP"Ung  *"  Primes  up  to  „z  “  "S'0"  g0eS  “p  *“  »•  ‘ton 

in  th  1°  reS1°°’  a°d  il  is  Partitioned  across  the  th  'V'5™  fr°m  "  *°  1,2  is  called  the 
m  the  known  region  perform  ‘ie  threads.  Each  thread  uses  the  • 

synchronization  is  then  done  and  th  iT"  “  P°r,i0“  °f  ‘ie  “‘"own  region  A  b 

tocomenSton,  ’ and  tbe  toown  region  is  expand  to  n2  and  the  unknown  regicm 

Tiemmnch^r^o,^^ 

Regular  intervals  between  ,h  j 

•  No  sharing  of  data  in  th  ^  accesses. 

•  Very  inf  "  ^  the  Uni«  region. 

y  frequent  synchronization. 

2.2.2  Blkmat 

The  blkrnat  application  multiplies  matrices.  Figure  2  5f  t  n 

-tgure  2.5(a)  shows  the  partitioning 
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5  COmP“‘«i°n»,  thread  5  will  read  val  r  “mP"‘atio"  ^goed  to  thread  5  T 
region  of  matrix  R  rp  ,  read  Vaiues  from  a  4  x  4  region  r  d  5‘  T° 

«!».  front  4  and  *  r  '  “*  ~cation  £  ^ * ' * * ‘ 

repeatedly  retrieving  the”  ^  **  lb  Md  Teuses  ‘tee  local ^ copies  tte 

la  and  »  and  ,h  1  ^  * °m  sia"*>  memory  The  d  mhCr  'ha» 

~  s  was  ~  -  -  *°  «■— . 
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Figure  2.6:  Sor:  successive  over  relaxation. 


during  this  phase  are  much  higher  than  in  the  main  calculation  phase  of  the  algorithm. 
The  main  characteristics  of  blkmat  are: 

•  Low  access  rates  to  shared  memory. 

•  Varying  intervals  between  accesses. 

•  A  separate  termination  phase  with  much  higher  access  rates. 

2.2.3  Sor 


The  sor  application  is  an  iterative  solver  of  Laplace’s  equation  using  the  method 
of  successive  over  relaxation.  We  use  it  for  computing  the  beat  flow  in  a  square  metal 
plate.  The  plate  is  represented  by  a  grid  of  cells  and  is  partitioned  into  regions  as  shown 
m  Figure  2.6.  Interactions  between  threads  occur  along  the  edges  between  regions,  and 
thus  the  cells  are  partitioned  into  squarish  regions  in  order  to  minimise  the  lengths  of  their 

edges.  The  outside  edges  of  the  grid  contain  the  fixed  boundary  conditions  and  are  not  part 
of  any  thread’s  partition. 


The  computation  proceeds  by  taking  a  cell  in  the  grid  and  replacing  it  with  a  new 
value  computed  as  a  weighted  sum  of  the  old  value  and  the  four  manhattan  neighboring 
cells.  The  weights  are  chosen  so  as  to  make  the  computation  converge  as  quickly  as  possible. 

After  every  few  iterations  convergence  is  checked  for  by  comparing  the  new  values  to  saved 
copies  of  previous  values. 

In  order  to  avoid  mixing  results  from  the  current  and  previous  iterations,  the  grid 
is  split  like  a  checkerboard  into  red  and  black  cells.  First  all  of  the  red  cells  are  calculated 
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“ b T  n  “  y  a  barrieI  Sy"d“°»'  After  the  harrier  ah  of 

the  black  cells  are  calculated  and  updated,  and  then  there  is  another  barrier  Bee  , 

of  the  red  and  black  cells  alternates  in  this  fashion.  Hecalculat.on 

The  main  characteristics  of  sor  are: 

•  High  access  rates  to  shared  memory. 

•  Repeated  barrier  synchronization. 

•  Static  partitioning  and  reuse  of  shared  data. 

2.2.4  Ugray 

The  up-ay  application  is  a  ray  tracing  graphics  renderer.  This  is  a  comouta,-  „ 
intensive  rendering  algorithm  for  producing  high  quaUty  images  The  sequentl 
discussed  in  [Mar87],  and  its  parallelisation  is  discussed  in  (Boo89]  ^ 

structures1  Th?,lT7re!  T  *  ^  *  ””  “  *  »  ^  connected 
-Pace.  It  is  used  “  3“^  i ^ IZ 

-  has  7188  faces  and  uses  7  ^  7  “*  ^ 

The  main  characteristics  of  ugray  are: 

•  Complex  linked  data  structures,  (can  not  be  prefetched) 

•  Moderate  access  rates,  (complex  calculations) 

•  Dynamic  scheduling  of  jobs. 

•  Unpredictable  reuse  of  data. 


2.2.5  Water 


A  bri  f  Z  “at8r  aPP“Cati°°  Sim“Ia'eS  "  5yStem  °f  Water  molecuIes  in  the  liquid  state 

reporttSW^l  ft  °  NhPrCa,i0n  ^  "*  Paia“”  •»«  in  the  SPLASH 

with  molecules  beyond  a  certain  distance.  ^ 

Thus  7e,data  Se‘  W  USC<1  ^  M3  n’°IeCUleS  (thiS  the  >"*««  data  se,  available) 
moecues  are  stat.cally  subdivided  among  the  threads,  and  the  same  molecules  are 
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-  -rrr rx :::r;  - - *7  -  -  - — «- 

of  the  small  (and  odd)  number  of  mo|€cu]es  ]o  J  J*  “0t  d°«  to  «“*  other).  Because 

mos,  heavily  loaded  thread  de.ertnines  the  rate  of  progr TTofTe  ^  "* 

our  simulations  we  tried  to  choose  th  u  f  h  COmPutatlon-  Thus  for 

would  have  less  work,  rather  than  JreT  ^  ^  ^  ^  imbaIanced  breads 

—  for  the  number  of  7“  “*  ^  ^  ^d 

for  example,  each  thread  is  given  2  moIe’  1  ’  ’  ’  ’  ^  ^  ^  Wi‘h  172  ttreads. 

1-  Thus  the  last  thread  will  finish  earlier  than  tTeTdhm  ,hlead  Wl“Ch  ^  °nly 

instead  we  had  used  171  threads,  each  thread  would  getlaol  7  ^  ^  “ 

which  would  get  3  molecules!  This  thread  w  la  .  eX“P‘  for  OM  of  them 

have  to  wait  on  fc  Nearly  the  entire  machine  woul^  ~“ 

™  ^o  some  large  calculaZ^h  “  ^  °b“”  **  *  -es.  There 

during  which  no  remote  accesses  occur.  7  ’““h163  “h"8  “  1<m«  ^ds 

The  main  characteristics  of  water  are: 

Bursty  traffic  with  long  periods  having  no  remote  accesses. 

•  C^ood  reuse  of  data. 

•  Imperfect  static  load  balanci™  tw  • 

balancing  that  is  sensit.ve  to  the  number  of  threads  used. 

2.2.6  Locus 

«on  of  this  application  and  its  par^!^  A  bricf  d«crip- 

channels.  The  wire  counts  of  the  channels  ale  7  °fWireS  ““d 

and  the  butt  of  the  the  shared  m  h  *  ‘W°  "“—•I  cos.  array 

-edule  and  route  wires 
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initial  routing,  wires  are  ripped  up  and  re-routed  to  further  optimize  the  result. 

We  used  the  largest  input  circuit  available  (Primary2.grin)  which  has  3817  wires 
and  a  1290  x  20  array  of  routing  channels.  This  input  shows  good  speedups  up  to  around 
64  threads,  but  performance  gains  diminish  past  this  point.  This  application  has  the  least 
parallelism  of  the  applications  we  have  used,  but  was  included  for  reasons  of  application 
diversity. 

The  main  characteristics  of  locus  are: 

•  High  access  rates. 

»■ 

•  Linear  sequences  of  array  accesses. 

•  Dynamic  scheduling. 

•  Limited  parallelism. 

2.2.7  Mp3d 

The  mp3d  application  simulates  rarefied  fluid  flow,  such  as  that  which  occurs  in  the 
upper  atmosphere.  It  uses  Monte  Carlo  methods  and  simulates  a  representative  collection 
of  molecules.  A  brief  description  of  the  application  and  its  parallelization  appears  in  the 
SPLASH  report [SWG 92]. 

We  simulate  a  system  of  100,000  molecules.  These  molecules  are  statically  assigned 
to  threads,  but  no  attempt  is  made  to  assign  nearby  molecules  to  the  same  thread.  Because 
of  this,  the  interactions  of  molecules  are  almost  always  with  molecules  assigned  to  other 
threads,  and  since  the  molecule  are  all  moving,  the  collection  of  interactions  is  changing 
constantly.  The  net  result  is  that  there  is  little  reuse  of  data.3 

The  main  characteristics  of  mp3d  are: 

•  High  access  rates. 

•  Little  reuse  of  data. 

2.2.8  Barnes 

The  bames  application  simulates  the  gravitational  interaction  of  a  system  of  n 
bodies.  It  uses  the  O(nlogn)  Barnes-Hut  algorithm  rather  than  the  0(n2)  direct  pairwise 
computation.  A  brief  description  of  this  application  and  its  parallelization  appears  in  the 
SPLASH  report  [SWG  92],  and  [SHG92]  is  a  more  thorough  study. 

3Mp3d  has  since  been  rewritten  at  NAS  A- Ames  using  spatial  decomposition  techniques  and  has  improved 
locality  of  reference[LLJ+92].  Unfortunately  this  improved  code  has  not  been  publicly  released. 
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Application 

Access  Rate 

Reuse  of  Data 

Comment 

sieve 

high 

very  high 

regular  access  intervals 

blkmat 

low 

low 

varying  intervals  between  accesses 

sor 

high 

high 

many  barrier  synchronizations 

ugray 

medium 

medium 

complex  data  structures 

water 

medium 

medium 

bursty  traffic 

locus 

high 

high 

linear  sequences  of  accesses 

mp3d 

high 

low 

changing  data  usage 

barnes 

medium 

high 

complex  data  structures 

Table  2.2:  Summary  of  Application  Characteristics. 


In  this  application,  bodies  are  organized  in  a  three  dimensional  hierarchical  struc¬ 
ture  called  an  octree.  This  allows  aggregation  of  distant  particles  for  computational  ef¬ 
ficiency,  but  individual  access  to  nearby  particles  for  computational  accuracy.  This  is  a 
well  crafted  implementation  that  assigns  neighboring  particles  to  the  same  thread,  and  thus 
there  is  much  reuse  of  data. 

Hierarchical  structures  are  used  for  both  the  organization  of  data  and  the  parti¬ 
tioning  of  work  among  the  threads.  Building  tree  like  structures  can  not  be  completely 
parallelized  since  there  is  little  concurrency  near  the  root  of  a  tree.  With  large  numbers  of 
threads,  a  substantial  amount  of  time  is  spent  waiting  for  synchronization  events.  This  is 
due  both  to  incomplete  parallelization  of  tree  operations  and  to  imperfect  load  balancing 
among  the  threads. 

The  main  characteristics  of  barnes  are: 

•  Moderate  access  rates. 

•  High  reuse  of  data. 

•  Many  long  synchronization  stalls. 


2.2.9  Summary  of  Application  Characteristics 

Table  2.2  summarizes  the  preceding  application  discussions  in  terms  of  the  ap¬ 
plications’  access  rates  and  reuse  of  data.  These  characteristics  will  affect  the  results  of 
the  experiments  presented  in  this  dissertation.  High  access  rates,  for  example,  will  require 
multithreading  to  use  many  threads  per  processor,  and  low  reuse  of  data  will  cause  caching 
to  perform  poorly. 
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2.3  Simulation  System 


In  order  to  conduct  this  research  we  have  built  a  fast  and  accurate  simulator  called 
FAST  (for  Fast  Accurate  Simulation  Tool).  In  this  section  we  summarize  a  few  important 
details  and  then  discuss  the  usage  of  the  simulator  for  the  studies  conducted  as  part  of  this 
thesis.  Appendix  B  contains  a  detailed  discussion  of  the  simulator  and  the  techniques  and 
tradeoffs  chosen  in  its  design. 


2.3.1  The  Simulator 

The  simulator  is  based  on  the  technique  of  execution  driven  simulation.  This  is 
a  process  whereby  the  application  program  to  be  simulated  is  actually  directly  executed, 
but  it  has  been  modified  so  that  it  counts  its  own  execution  time  and  returns  control  to 
the  simulator  at  special  events,  such  as  shared  memory  references.  The  simulator  works 
by  executing  the  many  parallel  threads  for  small  periods  of  time,  and  then  scheduling  the 
resulting  events  so  that  they  are  all  simulated  in  a  correct  global  time  order. 

The  modifications  to  the  application  program  are  best  made  at  the  object  code 
level,  since  at  this  level  accurate  timing  can  be  determined  based  on  the  individual  assembly 
language  instructions.  All  of  our  applications  were  compiled  at  optimization  level  “-02”, 
and  their  timing  results  are  based  on  this. 

The  simulator  accurately  models  the  timing  of  the  MIPS  R3000[Kan89]  pipeline, 
and  all  interactions  between  threads  are  accurately  ordered.  One  slight  inaccuracy  occurs 
for  simulations  using  caching  of  shared  data:  the  cache  interactions,  such  as  invalidations, 
are  done  instantaneously  rather  than  being  delayed  for  the  transit  time  for  the  invalidation 
messages  to  travel  from  the  directory  to  the  cache.  This  simplification  makes  the  cache 
simulator  much  more  efficient  and  easier  to  write,  but  means  that  data  gets  invalidated 
slightly  sooner  than  it  would  on  a  real  machine. 

Because  of  careful  use  of  execution  driven  simulation  techniques,  our  simulator 
is  approximately  50  times  faster  than  comparable  simulators  such  as  Tango[DGH91]  or 
[0’K89].  The  main  advantage  of  this  speed  is  that  it  allows  us  to  run  longer  and  larger  sim¬ 
ulations  (and  thus  more  representative  of  large  systems)  than  those  of  previous  researchers. 
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Experiment:  efficiency  on  an  ideal  machine  ] 

Application 

Processors 

Multithreading 

- - - - - 

sieve 

1-1024 

1 

•  Latency  =  0  cycles 

blkmat 

1-1024 

1 

sor 

1-1024 

1 

ugray 

1-512 

1 

water 

1-343 

1 

locus 

1-128 

1 

mp3d 

1-1024 

1 

barnes 

%■ 

1-512 

1 

Table  2.3:  Experimental  parameters  for  measuring  the  execution  efficiencies 
on  an  ideal  machine. 


2.3.2  Simulation  Constraints 

There  are  a  number  of  constraints  that  have  kept  us  from  running  simulations 
as  large  as  we  would  have  liked.  First,  some  of  the  applications  (water  and  locus)  had 
only  moderate  input  sizes  available.  Second,  despite  our  fast  simulator,  simulation  is  still 
time  consuming.  We  thus  restricted  problem  sizes  so  that  individual  simulations  completed 
within  a  few  hours.  And  third,  simulations  of  large  parallel  machines  require  a  lot  of  space  to 
hold  the  state  of  the  many  simulated  threads  and  caches.  We  were  limited  to  128  mega-bytes 
that  was  available  on  the  largest  of  our  simulation  host  machines. 

Table  2.1  listed  the  input  sizes  that  were  used  for  each  of  the  applications.  In  order 
to  gauge  the  amount  of  parallelism  available,  we  have  simulated  the  applications  as  if  they 
were  executing  on  an  “ideal”  machine  that  had  0  latency  and  no  contention  on  accesses  to 
shared  memory.  Such  a  machine  would  be  impossible  to  build,  but  it  corresponds  to  an 
upper  bound  on  achievable  performance.  Table  2.3  list  the  experiment’s  parameters  and 
Figure  2.7  shows  the  results. 

Rather  than  show  the  standard  speedup  curves  (speedup  =  execution  time  on  1 
processor  /  execution  time  on  P  processors),  we  have  plotted  the  efficiency  vs.  the  number 
of  processors  (efficiency  =  speedup  /  P).  Efficiency  is  much  like  processor  utilization.  The 
difference  is  that  efficiency  is  directly  related  to  performance,  where  as  utilization  is  simply 
a  metric  of  how  busy  the  processors  are.  For  example,  processors  might  be  busy  spinning 
or  doing  redundant  work  and  thus  not  contributing  to  overall  speedup.  The  advantage  of 
efficiency  over  speedup  is  that  it  has  been  normalized  by  the  number  of  processors  and 


31 


which  point  i,  drops  more  qu.c%  ™  “  °f  P'««30rs  exceeds  some  limit,  at 

We  are  simulating  fixed  size  problems  and  ti. 
increased,  the  work  gets  partitioned  more  finely  Mo  T  “  °f  Pr°“SS°«  is 

C,K  **""•  “  «™  point  because  of  various  117 ,  “  Ti*  «**«- 

uneven  load  balancing,  synchronization  overhead  d  7  “*  ^  as: 

77-  “*  (»*  as  a,  the  root  of  a  “"er7  ^ 

of  its  jagged  efficiency  curve.  This  is  the  result  of  o'  ’  0U'  “  PiSUre  2  7  be“nse 

number  of  processors  is  incongruent  to  hT  ^  ““  »■»  tt. 

Section  2.2.5.  “*  °f  1 (343),  as  was  expimned  in 

Figure  2.7  was  used  to  choose  a  “reasonable”  V  > 

(amount  of  parallelism)  that  conid  be  used  for  eT  h  ,  ”  **  M“b*  “f  “>«ads 

sines  that  we  were  able  to  simulate  These  thre  d  V  ^  "0°  ^  the  Problem 

various  simulation  experiments  in  tffis  ^is  7  T"  ^  ^  2'4' 

Of  threads  and  processors  that  were  used.  ’  '  h™'5  C°I>str*,'"«i  the  number 
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oitive 

bJJonat  256 

Sor  256 

ugray  25g 

*ater  120 

64 


iocus 
mp3d  256 
128 


i  jr 

barnes 


Table  *2.4:  Limits  on  the  numh 

tions  because  of  •  ber  of  threads 

m  fed  si«  Problems.  "y“'*  *°  ™ract 

These  thread  limite 

avlu“l  tlt‘  feed  Pr<>bfc- «  «ceed  ,hem  Tk 

tPachines  k  *  ‘here  "  *  ^  «  **  ».TO  ^ 

'  ^  Problem  sires  wiu  be  nwd  Jf  ™  *  be  ased  oa 

2  Q  O  _  d  to  suPP^y  additional  ,  S  paraJiel 

2-3'3  Revised  Machine  Model 

T 


^n  Section  2  1  i  D 
tions  of  the  iaten  6  SG  ecte^  a  network  Jafp 

-«*»  -Snnr  ■* *  - 

,*  ™«*<~  We  have  theT""^'  *  -  -abie  to  »°™«  »ecaase  „f  the 

atency.  0ur  revised  machiae  ZTr  “"mber  °f  pr°««or»  bTi^  “°”Sil  for  suci 
^  Perf”^  from  W  to  32  del  “  **“*  %.re  2  »  '£'*»  “*  ^  ■*** 

We  ^  *Ws  reductio  r  ^  ^  ^ 

oar  resaits  more  directly  “  T  °f  pr°«*««rs  bat  . 

objectives  of  tJu-s  ^  ^  ^“Me  lo  .  I000  p  ^  ’““•W*  .he  iateacy 

Ptemory  iateaeies  of  W  „  ,  ‘“g  ““M.breadiag  as  a  °M  of  «*  maia 

^-itithrea^;^  “  ^  *»  oa,  ^  C 

relatively  lougi„tervaIs  *  “*  ^ior  of  ,ie  tadividnaJ  *“«  d«eraaai„g  the 

“  *ken  maititj,  6e”  Peaces,  and  >Uh  ^  «  *«*  ««„,!  £ 

Primarily  detP  wiU  "ork  Well  ’  ^  lDte^s  are  fa]r1v  ^ 

y  detej,«iined  by  the  Bp  The  interval”  K  l  fly  c°nstant 

- » *  «•  *  -  ^:rr-*  ■• 
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IE 
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El 

1  [cj 

*  *  -1 

J[ 

Processors 


CJ  (optional  Caches) 


M|  Shared  Memory 

Figure  2.8:  Revised  model  of  parallel  machine  with  fewer  processors  but  the 
same  latency  as  expected  on  a  1000  processor  machine. 


problems,  we  expect  to  see  the  same  multithreading  behavior  on  large  machines  as  we 
observe  m  our  studies  of  the  reduced  machine  model. 

A  final  note  on  our  simulations  is  that  all  results  are  based  solely  on  the  parallel 
ase  o  computation.  All  of  the  applications  studied  in  this  dissertation  have  a  sequential 
initialization  phase,  a  parade,  computation  phase,  and  a  sequential  termination  pise* 

f  COmmon  practlce  m  emulation  studies  to  report  only  the  parallel  phase.  This  is  done 
r  a  num  er  of  reasons.  First,  many  of  the  application  are  iterative,  but  only  a  small 
number  of  iterations  is  simulated,  thus  artificially  decreasing  the  duration  of  the  parallel 
ase  and  thereby  increasing  the  significance  of  the  sequential  phases.  Second  as  problem 
sizes  are  increased,  the  sequential  phases  become  a  smaller  and  smaller  fraction  of  the  t 

“  many  of  these  applications  were  written  for  ^ 

Wed  memory  machines,  and  often  much  o,  the  initiation  could  have  been  pari^ 
this  was  not  deemed  necessary  on  a  small  machine.  Finally,  a  large  part  of  th 

is  - — -  -  ~  -ir:;::: 
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Chapter  3 


Behavior  of  Multithreadi 


^aaing 

111  this  chapter  we  present  an  u 

“ faw,iM  -  “■  *  ~  * — 

6  m  our  aimulation  studies. 


31  Multithreading  Mode] 


CUBer’  Md  vo”  Beta  i„  [SBCvE9o]^fp  if by  Saavedra  Ba 
-the  mode]  considers  iUst  a  G;  i 

that  a  thread  on  tv  S1ngie  processor  of  a  naran  i 

on  this  processor  repeatedlv  •  Parallel  computer.  We  a«, 

(called  the  run-Wth  t  ,  P  edly  issues  remote  references  at  a  • 

ngth),  and  that  after  issnino-  at  M  Jnterva 1  of  i?  CVrj„c 

the  response  to  return  W  g  *  remote  reference  the  th  , 

delay  throat  res™taS  execution  The  „  thread  m“s‘  wait  for 

*  7  “'““""ectioa  „e,woris  (q  .  ^  ,ime  depe„d5  ^  ^ 

>  fixed  round-trip  latency  „f  z  cycles  T°a  “■“»*  and  for  the  mode|  we 

-these  two  parameters  R  at 

F'gnre  3.2  shows  the  same  situs.,'  u  be  pooriy  utilized 
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Figure  3.1:  Model  of  a  single  thread.  (R  =  Run-length,  L  =  Latency) 


Figure  3.2:  Multithreading  with  3  threads  per  processor, 
switch  cost) 


(C  =  Context 


3.1.1  Analysis  Under  Constant  Run-Lengths 

Under  simple  assumptions  about  R,  L,  and  C,  we  can  compute  the  processor 

TTZ  c  a  T”  °f ‘he  m"“i,lreadi”8  W  -sumption  is  that 

,  L  C  are  aU  constants.  Under  this  assumption,  the  performance  analysis  can  be 
broken  rnto  two  separate  cases.  The  first  case  occurs  when  there  are  no.  enough  threads 

this  C1  Til  “I1™  "*  PrOCCSSOr  “  S°me,im,is  idle-  3.1  &  3.2  both  show 

P-od.  The  T  ^  T  ^  ^  “* 

T,  P  10d  18  "  +  L  and  the  amount  of  work  done  is  M  •  R 

us  processor  Utdiration  is  MR/(R  +  L).  In  this  case  performance  increases  linearly  with 

the  1!  7  !  ^  S8COnd  Wte"  ,h“e  «  ««*  threads  to  htde 

and  ,h  n  S  ^  ‘he  0,Jy  PCrf°ImMCe  1085  Comes  from  context  switch  overhead 

wh I  inirrnT, is  + c)- The  b°Mdary  be,w“p  *»°  -  <■*«.’ 

n  M(R  +  C)  -  R  +  L.  Solving  for  M  we  get: 

Processor  Utilization  =  j  *+£  lf  M  <  1  +  £=£ 

(.  H+c  otherwise 

If  C  is  small,  we  can  approximate  the  number  of  threads  needed  to  maximise 


36 


Figure  3.3:  Processor  utilization 
run-length  distributions. 


35  3  fuDCtion  of  multithreading  for  various 


processor  utilization  as  M  =  1  +  l/R. 

Ruh,  1  wm  a  consist  nn,ength  appn> 

needed  to  keep  the  processor  busy.  +  L/R  threads  are 


This  function  is  shown  graphically  in  Figure  3  3  M  „  ,  ,  , 

-ing  the  values  of  «  =  »,  t  .  200  *  _  “  ‘he  ‘W-f. 

expect  for  real  applications  and  hardware.  ’  W  c  are  similar  the  values  that  we 
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3.1.2  More  Complex  Distributions 

In  real  applications  the  run-lengths  will  rarely  be  predictable.  A  better  model 
o  c  oose  t  e  run-lengths  from  some  random  distribution.  Figure  3.3  also  shows  the 

™r  curws  for  ,hr“  °,w 

* , “  Para“etr,Zed  t0  h“e  ,hc  ■—  (R  =  20).  For  tie  uniform 

n  u  ion,  t  e  run-lengths  are  chosen  with  equal  probability  over  the  range  1  to  39  For 

e  geometric  distribution,  the  run-length  comes  from  a  sequence  of  biased  coin  dips  where 

eadh  step  he  probability  of  completion  is  P  =  1/20.  And  for  the  bimodal  distribution 

the  run-length  is  either:  *  =  1  with  75%  probability,  or  R  =  77  with  25%  probability.  ’ 

For  the  geometric  run-length  distribution  the  model  was  solved  by  Saavedra 
Barrera  and  Cuder(SBC91J  using  Marlov  ch*n  analysis.  However  for  more  ge I  dT 

rrrr  become  ^ ^  «*— j-i  * 

tb  ad-  y  US1"e  ”“menC  5'mula,10n  *°  “"P"**  processor  utilization  versus  multi¬ 
reading  curves  for  any  specified  distribution.  For  the  uniform  and  bimodal  distributions 
the  curves  m  Figure  3.3  were  calculated  using  this  technique. 

The  histograms  of  these  distribution  functions  are  shown  in  Figure  3.4.  These 

rul  ll7hS  T  draW“  Wi“  Pi'eS  Wt°’S  “  *otal 

n-lengths  having  a  particular  value.  Piles  that  would  overlap  are  combined  to  male  a 

^wW  dTr  ‘°  1PPendbl  A  f”  “  C°mPle*e  a”d  “  *  “  -  -d 

a  O  Stogram.  These  histograms  will  be  used  as  a  basis  for  budding  intuition 
into  multithreading  behavior. 

that  th  ®y(C°“Pari"g  the  distrib“li°"=  and  the  plots  of  their  performance,  we  can  observe 
1m.  the  distributions  with  the  most  short  run-lengths  need  the  highest  multithreading 

evels.  This  occurs  even  though  the  mean  run-length  remains  the  same  (20)  for  ah  of 

thread!!*'  T  Sh°r‘  r,m'length5  Pr°blemS  When'  *  <*»«•.  several 

ends  have  short  run-lengths  in  succession,  and  the  remaining  threads  are  unable  to  hide 

e  latency.  In  these  cases  the  processor  wiU  be  forced  to  stall.  In  the  opposite  case  when 

several  successive  threads  ah  have  long  run-lengths,  the  latency  is  easily  covered,  b’u,  the 
excess  latency  tolerance  is  wasted. 


Rule  2  When  run-lengths  are  random,  the  presence  of  short 
tithreading  level  needed  to  keep  the  processor  busy. 


run-lengths  increases  the  mul- 
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Figure  3.4:  Histograms  of  distribution  functions.  The  horizontal  axis  shows 
the  run-length.  Each  data  point  is  represented  by  a  pile  who’s  size  corre¬ 
sponds  to  its  percentage  of  the  total,  and  overlapping  piles  combine  together 
to  make  taller  piles. 
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•  Latency  =  200  cycles 

•  Context  switch  =  0  cycles 

•  Scheduling  =  roUnd  robin 

•  No  shared  memory  caches 


TaWe  &P*“  ~ SBilc,on,oa, 

3  2  APP'iCati0ns’  R^-Length  Distributions 

Figures  3.5  &  3.6  show  the  run-lentrtK  A-  *  -u  • 

-ader  the  switch-on-load  multithreading  mode]8  Th«  ““  7  ““  '>e,,Cl“,,arl  aPPlte“i°“' 
1-ted  from  simulations  „  „  ^  ™-length  distributions  were  col- 

assumptions  will  be  discussed  in  the  next  section  w  ^  •**  simulation 

first  so  that  we  can  make  predictions  based  on  wh  i  t  “  "“'’“S*1  ■“'‘"buttons 

“  Wha‘ We  iave  '—1  f-n.  «he  multithreading 

The  first  tour^t:::  <'iS'rib“,i°,‘ 

“  previous  chapter:  sieve  is  simll^he  7  *« 

“a  P«e  near  20,  is  similar  to  the  umtm  d^  bT 

tnbution,  and  ugray  is  similar  to  the  geomet^  d-1™’  ^  '°  bimodal  dis- 

*  ""  long  run-lengths  (around  50oLy2TTt  h  ^  *“  "*«  b- 

Witten  so  that  i,  copies  shared  data  values  into  1  °CCUr  '  "■ 

calculations  using  only  the  loc^  copies.  Jb"  ^  1~ta”  ^ 

shared  data,  and  because  of  this,  i,  achieves  the  high'”,  Cat'°"  ^  ^  C°pies  °f 

(43)  and  bames  (42)  also  exhibit  tong  mean  run  1  T**  "“''“S*1  (1M  Cyd«>- 

have  more  moderate  run-, eng, hs.  And  „p3d  (9  5)  Z  ,7  ^  (22>  8“ve  (19) 

run-lengths,  P  ’  >’ locus  (7).  and  sor  (6)  have  short  mean 

Based  on  the  mean  run-leneths  A  j-  . 

**"  a"‘ how  these  ap“s  - 


sieve 


mean  =  18.9 


1*  o  K  ;  '  1  I  ln~rTTrl - r-T-TTTTrri-L 

,1°  'f02?0  5°0  IK 

mean  =  i64!o[~ 


^  L~L^~LLUr^ - ^ - L-iJ-J_L±il__ 

j  ,  mean  =  6.0  i~ 


ugray  Lj-iiii - L_j__UJaiii  j  li(ii  , 


mean  =  22.2 


1  p  !  '  I  1  ^TT7Tt1 - r-r-TTTTm-1- 

10  20  50  100  200  500  IK  ok 

Figure  3.5:  Histograms  of  thp  ,  2K  5K  10 K+ 

running  under  switch-on-Ioad™*  ^h"*0118  of  the  applications 
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Figure  3.6:  Histograms  of  the  run-lengths  distributions  of  the  applications 
running  under  switch-on-load. 
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Experiment:  switch-on-load 

Application 

Processors 

Multithreading 

•  Latency  =  200  cycles 

•  Context  switch  =  0  cycles 

•  Scheduling  =  round  robin 

•  No  shared  memory  caches 

sieve 

blkmat 

sor 

ugray 

water 

locus 

mp3d 

barnes 

16 

32 

8 

8 

10 

2 

8 

12 _ 

1-20 

1-20 

1-50 

1-20 

1-34 

1-40 

1-40 

1-20 

Table  3.2:  Experimental  parameters  for  switch-on-load. 


short  mean  run-lengths,  we  can  expect  sor,  locus,  and  mp3d  to  require  high  multithreading 
levels  to  keep  the  processor  busy.  Also,  ugray  and  water  will  require  extra  multithreading 
because  of  the  large  number  of  short  run-lengths  in  their  distributions. 


3.3  Testing  the  Multithreading  Model 

In  this  section  we  compare  the  performance  predicted  by  the  multithreading  model 
to  the  actual  performance  observed  in  simulations.  The  parameters  of  the  simulation  ex¬ 
periments  axe  shown  in  table  3.2.  We  have  assumed  a  200  cycle  remote  access  latency  and 
a  context  switch  cost  of  0  cycles1 . 

Many  of  the  applications  will  require  a  large  multithreading  level  in  order  to  reach 
high  execution  efficiencies,  but  the  total  number  of  threads  available  is  limited  by  the  fixed 
problem  sizes  that  we  are  able  simulate  (as  discussed  in  Chapter  2).  Therefore,  for  each 
application,  we  have  taken  the  multithreading  level  (M)  needed  in  order  to  achieve  high 
efficiency,  and  selected  the  number  of  processors  so  that  P  ■  M  is  approximately  equal  to 
the  thread  limit.  The  results  are  presented  here  as  if  a  just  a  single  set  of  experiments  were 
performed,  but,  in  fact,  preliminary  experiments  were  also  performed  in  order  to  determine 
the  multithreading  levels  needed  by  the  applications. 

For  some  applications,  such  as  locus,  the  resultant  number  of  processors  used  was 
quite  small.  In  later  experiments,  with  better  multithreading  models  that  require  fewer 
threads  per  processor,  we  will  increase  the  number  of  processors  used  in  our  simulations. 


'The  zero  cycle  context  switch  is  justified  in  Chapter  8. 
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Figures  3.7  &  3.8  show  the  predicted  and  observed  performance  of  the  applications 
under  the  switch-on-load  multithreading  model.  The  predicted  performance  is  based 
on  the  multithreading  model  and  the  run-lengths  distributions  presented  in  the  previous 
section.  The  observed  performance  is  based  on  the  simulations. 

The  multithreading  model  predicts  processor  utilization  rather  than  our  preferred 
metric  of  execution  efficiency ,  which  we  use  for  most  of  the  results  presented  in  this  disser¬ 
tation.  The  model  predicts  processor  utilization  because  the  run-length  distributions  fed 
into  it  reflect  the  entire  parallel  execution.  Some  of  this  execution  may  include  extra  opera¬ 
tions  performed  by  the  parallel  program  that  are  not  performed  by  the  sequential  program. 
These  extra  operations  keep  the  processor  busy,  but  they  do  not  contributed  to  application 
speedup.  Figures  3.7  &  3.8  show  both  the  processor  utilizations  and  the  execution  effi- 
ciencies  observed  in  the  simulations.  For  some  applications  (water,  mp3d,  and  locus)  the 
processor  utilizations  and  executions  efficiencies  axe  indiscernible  from  each  other.  For  the 
others,  the  gap  between  utilization  and  efficiency  arises  because  of  the  extra  operations  done 
by  the  parallel  programs.  Locus  and  ugray,  for  example,  both  do  dynamic  job  scheduling 
and  use  spinning  to  wait  for  jobs  to  become  available.  This  spinning  keeps  the  processors 
busy  but  does  not  perform  useful  computation.  Sieve,  blkmat,  and  sor  also  exhibit  a  gap 
between  processor  utilization  and  efficiency.  These  applications  do  static  partitioning  of  the 
work  among  the  threads.  Each  thread  does  the  partitioning  calculation,  and  thus  with  more 
threads,  more  time  is  spent  doing  these  partitioning  calculations  that  are  not  needed  by 
the  sequential  programs.  All  of  the  applications  actually  have  parallel  overheads.  They  are 
just  much  more  visible  for  sieve,  blkmat,  and  sor  because  these  applications  have  shorter 

execution  times  than  the  other  applications. 

For  most  of  the  applications,  there  is  a  also  large  gap  between  the  processor  uti¬ 
lization  predicted  by  the  multithreading  model  and  the  processor  utilization  observed  in 
the  simulations.  This  gap  arises  because  the  processors  sometimes  sit  idle  or  underuti¬ 
lized  while  threads  wait  on  synchronization  or  because  of  imperfect  load  balancing.  The 
jaggedness  in  the  processor  utilization  curves  for  sor  and  water  is  an  indicator  of  this  load 
imbalance  problem.  Another  inaccuracy  of  the  multithreading  model  is  that  it  assumes 
that  the  run-lengths  drawn  from  the  distribution  are  mutually  independent.  In  actuality, 
the  applications  proceed  through  different  phases  of  their  computations;  some  phases  have 
short  run-lengths,  and  some  phases  have  long  run-lengths.  All  of  these  reasons  contribute 
to  the  optimistic  predictions  of  the  multithreading  model. 


Processor  Utilization  (%)  Processor  mbaam  (%) 
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- - -  Processor  utilization  predicted  by  model 

Processor  utilization  observed  in  simulations 
* - *  Efficiency  observed  in  simulations 


Figure  3.7:  Predicted  and  observed  performance  for  switch-on-load. 


Processor  Utilization  (%)  Processor  Utilization  (%) 
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- o  Processor  utilization  predicted  by  model 

—  -«  Processor  utilization  observed  in  simulations 
— Efficiency  observed  in  simulations 


i 


Figure  3.8:  Predicted  and  observed  performance  for  switch-on-load. 
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Sor  has  a  very  strong  correlation  between  threads.  The  threads  all  h™. 

not  so  well  synchronized  lo™  .  ,  8  y  U  tae  threads  were 

i~»  ««„  t1' 

other  multithreading  models  do  not  hen  fit  t  .  h  appLcatlons 

Dot  pursue  i,  faj.  ,10,  bCne8t  fr°m  “  f“d»“  S“S,  and  thus  we  do 

,  ~:=zr:  .“f;  r.t 

-  ■—  -  — *  •»  “ 

described  in  Section  2  2  2)  w  th  u  “  Par,1“°°S  the  ““PDtation  into  blocks  (as 

level  of  4  and  th  tb  •  rom  a  emulation  of  32  processors  at  a  multithreading 

the  simulations  at  smaller  multithreading  levels  had  longer  run-lengths. 

3.4  Conclusions 

bavior  ^  be- 

*  =  1  +  V*  -  ■—  to  bide  .be  iaCt::;zi;.a  “““*  ^  * 

^  tecnired,  particuiari,  for  tbose  dis.ribn.ions  with 

For  real  applications,  the  situation  is  more  complex  TTnn, 
model  of  program  behavior  r„  i  .  P  ‘  UnllIce  °nr  mathematical 

P  °gram  behavior>  real  apphcations  have  varying  behavior  over  time  and  th  • 
threads  are  not  independent.  In  sor  for  examnle  th  , 

tation  and  convergence  checkin,  Th  ’  pWs  °f  Comp- 

and  thus  shorter  ^  ^  **  ~ 

average  run-lengths  than  the  computation  phases  A]=«  *1,  j 

ili^: “  of  w  ■— >  -  - 

“uititbreaded  systems,  in  ^ 
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Chapter  4 


Multithreading  Without  Caching 

-  :r ;r 

vide  caching  of  shared  data.  Their  advant  6  ^  m°deIS  ^  do  not  Pr°- 

the  C°St  comPlexity  of  cache  coherency6  Thdr  di^ 7  ^  ^  aV°id 

r,:r  -  — •  -  r  2-  - 

switch  because  it  context  swlre!^^^  «*“<*- 

problem  by  providing  a  mecllMism  *  ^..-switch  model  solves  (his 

^  ~  -  then  issue  the  ent^  rfT  ^ 

it  context  switches.  S  P  ages  lnto  the  network  before 

4.1  New  Format 

n«,i~ 

figure  4.1(a).  We  are  tnainly  interested  in  ‘-el.  as  shown  in 

a”d  tu  “1‘i‘lreading  level  needed  to  obtain  it  For  f  t  °rma”Ce  ^  ’*  °b,aiMd 

t  *"*  to‘°  ^  **  bar  as  shown  in  Figure  4  1”  77  "  ^ 
efficrency  obtainable,  and  the  number  at  the  ton  •  A-  '  *'*  °f  ^  ^  shows  the 

The  other  lines  indicate  the  efficiencies  obtai  hi”  ^  m“lllthrea<iil‘S  lev'l  required, 
to  this  example  the  hiuh  T  “  ^  "****>*  levels. 

’  "SheS‘  "  8W  "  fo'  *  -  *•  *  «tos  level  the  efficiency 
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Efficiency 


Multithreading  Level 

(a)  old  format  (b)  new  format 

Figure  4.1:  New  format  for  presenting  multithreading  efficiency  results. 


is  84.6%.  A  slightly  higher  efficiency  of  85.2%  is  achieved  at  M  =  22,  but  we  do  not  show  it 
because  such  a  minor  increase  in  performance  would  probably  not  be  worth  increasing  the 
multithreading  level.  Although  we  may  hope  that  applications  will  have  abundant  threads, 
for  many  applications  threads  will  be  a  limited  resource.  In  this  and  other  results  presented 
in  this  bar  graph  format,  we  report  the  highest  efficiency  up  to  the  point  where  the  efficiency 
increases  by  less  than  1%  per  additional  thread. 


4.2  Switch- On-Load 

Figure  4.2  shows  the  switch-on-load  multithreading  efficiencies  in  the  new  for¬ 
mat.  Many  of  the  applications  need  large  multithreading  levels.  Particularly  high  are 
sor(Af  =  40),  locus(Af  =  32),  and  mp3d (Af  =  29).  Furthermore,  even  at  high  mul¬ 
tithreading  levels,  some  applications  are  achieving  only  moderate  efficiencies:  sor(59%), 
ugray(66%). 


Multithreading  Efficipnry 


Switch-On-Load 


100% . 


Figure  4.2:  Multithreading  levels  and  the  efficiencies 
switch-on-load. 


they  achieve  under 


cause  of  the  short  average'  Tie  ^  “*  mU',itirea<Ii^  mode!  , 

With  such  high  multithreading  levels,  large  n.TblT  St°r'  r™'ie”g‘hs  “  ,te  distrib“‘io 
vide  sufficient  parallelism.  Also  th  h  ^  ^  ^  required  in  order  to  pi 

will  be  large  because  of  the  large  numbe  ^  ^  ^  multithrea<ling  Iev< 

»i-«  - ™ « r  ~  ~r  -  -  • 

creasing  the  reared  mtdtithreading  level  has  many  bentZ  ZT  h°WeVer  " 

can  u.thre  the  full  set  of  processors,  less  hardware  is  needed  to 

less  apphcation  overhead  is  incurred  when  fewer  threads  are  used  ^  Se‘S’  “ 
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4.3  Increasing  the  Run-Lengths:  Explicit-Switch 

The  key  to  decreasing  the  multithreading  level  and  increasing  preformance  is  to 
increase  the  run-lengths.  This  involves  both  raising  the  average  run-lengths  and  eliminating 
short  run-lengths.  To  do  this,  a  thread  must  be  allowed  to  issue  more  than  one  reference 
into  the  network  before  being  context  switched.  There  are  two  multithreading  models  that 
address  this:  switch-on-use  and  explicit-switch. 

Under  switch-on-use,  the  hardware  issues  remote  loads  into  the  network  and 
continues  executing  the  same  thread.  It  context  switches  only  when  the  thread  tries  to  use 
a  value  that  has  not  yet  returned.  lithe  compiler  can  arrange  instructions  so  that  several 
remote  loads  are  grouped  together,  the  loads  will  all  be  issued  into  the  network  before  the 
thread  tries  to  use  any  of  the  results  and  is  forced  to  context  switch.  This  will  din,!-,.. 
excess  context  switching  and  thereby  increase  run-lengths. 

Another  way  to  allow  issuing  multiple  loads  before  context  switching  is  to  provide 
an  explicit  context  switch  instruction  that  the  compiler  can  insert  between  the  group  of 
loads  and  the  later  uses  of  the  requested  data.  The  effect  is  the  same  as  under  switch-on- 
use.  The  difference  is  that  the  hardware  may  be  a  little  simpler  under  explicit-switch 
than  under  switch-on-load  because  it  does  not  have  to  check  the  status  of  registers  as  they 
are  used.  In  this  section  we  will  explore  performance  of  the  explicit-switch  model.  We 

expect  that  the  results  for  switch-on-use  would  be  virtually  identical.  Relevant  hardware 
issues  are  discussed  in  Chapter  8. 


4.3.1  Grouping  Within  Basic  Blocks 

The  inner  loop  of  the  sor  application  is  shown  in  Figure  4.3(a)  as  an  example. 
Without  grouping,  the  5  loads  are  issued  one  at  a  time,  with  a  context  switch  after  each 
one.  In  Figure  4.3(b)  the  code  has  been  reorganized  so  that  all  5  loads  are  grouped  together 
and  are  then  followed  by  a  single  context  switch  instruction.  Rather  than  having  four  short 
run-lengths  followed  by  one  long  run-length,  there  is  now  just  a  single  long  run-length. 

A  compiler  designed  for  a  multithreaded  architecture  will  group  shared  loads  when¬ 
ever  possible.  Since  the  compilers  we  have  today  do  not  do  this  grouping,  we  wrote  a  post- 
processor  which  finds  the  basic  blocks  in  an  object  file1,  does  dependency  analysis  within  the 

except  1  8eqUCnCe  °f  hl8trUCti0nS  that  ~*ted  any  branches  into  or  out  of  it 


store 

branch 


Context 

Switch 

Points 


Data 

Dependencies 


S:  Ioop  of  sor  uader  8witch-°”-,o“d  “d  ~  *- 


inTtions  so  - to — — 

r  -  r , , :::: — ~  t:::rr:k- 
rxr  ;:r:  rrid~ — -  -  «+*£?* 

“*■  P— <  assunrpW  which  'Z£Z  a*6  rmWy  ,eVel>  We 

shared  ,„ad  inactions  have  heea  perco^  ZTZT^  ^  ^  "*  *' 
switch  instructions  as  needed  to  separate  the  •  '  re°rgaiuz<!r  mser,s  “■“« 

—  despite  the  iinritel  1“”  h  “  ^  ^  “*  «  ' 
our  code  reorganisation  appears  to  work  very  wel,  for  basic  ^ 

benchmark  applications^  Thl  grlupilg  factor  i!  the  Zlr^  ^  ^  ^  ^ 

— d  ^rs^rrr  r  °g  - — 

successfui,  with  focus,  siege,  and  bltanat  having  oniy  marina,  or  ^ 

***£““"  ““  ""y  ”'“”i  mighl  -7  —  iosd  h^  of  „d„„ 
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Application 

Grouping 

Factor 

Mean 

Run-Length 

sieve 

1.00 

19.3 

blkmat 

1.00 

164.9 

sor 

4.65 

29.2 

ugray 

1.29 

26.4 

water 

4.76 

206.6 

locus 

1.05 

8.3 

mp3d 

2.28 

22.6 

barnes 

1.68 

71.5 

Table  4.1:  Grouping  and  mean  run-lengths  achieved  for  the  applications 
alter  reorganization  of  their  basic  blocks. 


1  ExDPrimpnf •  k it  r  — — - 

Application 

*  uu-icng 

Processors 

ms  ior  explicit-. 
Multlt  h  rPA  r]iri(T 

switch  ~  - - — — - 1 

sieve 

16 

12 

•  Latency  =  200  cycles 

blkmat 

32 

4 

•  Context  switch  =  1  cycle 

sor 

16 

8 

•  Scheduling  =  round  robin 

ugray 

8 

12 

•  No  shared  memory  caches 

water 

20 

fi 

locus 

2 

28 

mp3d 

16 

<60  j 

barnes 

16 

_ LJ 

Table  4.2.  Experimental  parameters  for  measuring  run-lengths  for  explicit- 
switch. 


obtained  7  °6W  nm'le°gth  dis,rib”tio“  *»  shown  in  Figures  4.4  Sc  4.5;  they  were 

oxpTclrhT7hS  “  “  “  TaHe  4  2-  ^  distributions  for 

th  C  ‘  °  '  COmpared  <he  run-length  distributions  for  switch-on-load 

that  were  shown  on  pages  40  &  41.  °ad 

»n3d  7  ami  Vir‘Ually  °f  shoH  ™-la”*‘hs  have  been  eliminated 

aid  f::r;;;ih~ short  ™-,eosths’ tut  ttey  - 

The  other  four  applications  show  little  change  Locus  had  .  n 
Rouping  .ha,  eliminated  the  shortest  ,1  or  2  cycle,  run-, eng, hs,  bu,  thisTo" 
because  these  short  run-, eng, hs  comprised  only  4.5%  of  the  total.  The  change  ,n 
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Figure  4.4:  Histograms  of  the  run-lengths  distributions  of  the  applications 
running  under  explicit-switch. 
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Figure  4.5:  Histograms  of  the  run-lengths  distributions  of  the  applications 
running  under  explicit-switch. 
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Experiment:  explicit-switch 

Application 

Processors 

Multithreading 

- - - - 

sieve 

16 

1-12 

•  Latency  =  200  cycles 

blkmat 

32 

1-4 

•  Context  switch  =  1  cycle 

sor 

16 

1-8 

•  Scheduling  =  round  robin 

ugray 

8 

1-12 

•  No  shared  memory  caches 

water 

20 

1-6 

locus 

2 

1-28 

mp3d 

16 

1-14 

barnes 

16 

1-8 

Table  4.3:  Experimental  parameters  for  explicit-switch. 


mean  run-length  from  7.0  under  switch-on- load  to  8.3  under  explicit-switch  can  mainly 
be  attributed  to  the  extra  cycle  in  each  run-length  from  the  added  switch  instruction.  This 
extra  cycle  is  overhead  and  diminishes  performance.  The  next  most  troubling  application 
is  ugray.  The  grouping  factor  was  only  1.29  and  there  are  still  many  short  run-lengths  of 
just  2,  3,  or  4  cycles.  These  short  run-lengths  will  hamper  the  efforts  of  multithreading. 
The  lack  of  grouping  for  sieve  and  blkmat  is  unimportant  since  these  applications  already 
had  well  behaved  run-length  distributions  and  moderate  or  long  mean  run-lengths. 

The  experiments  used  to  measure  explicit-switch  execution  efficiencies  are  listed 
in  Table  4.3.  There  is  now  a  context  switch  cost  of  1  cycle  because  of  the  added  context 
switch  instructions.  We  have  also  increased  the  number  of  processors  used  for  sor,  water, 
mp3d,  and  barnes.  Under  explicit-switch  they  use  lower  multithreading  levels  than  they 
did  under  switch-on-load,  and  thus  the  surplus  threads  were  used  to  increase  the  number 
of  processors.  It  might  seem  odd  to  compare  results  from  switch-on-load  and  explicit- 
switch  that  use  different  numbers  of  processors,  nevertheless  it  is  reasonable  because  there 
is  very  little  difference  in  the  results  when  using  either  the  old  or  new  processor  numbers. 
This  might  seem  more  obvious  if  we  recall  that  the  multithreading  behavior  depends  on 
the  run-length  distributions.  For  the  switch-on-load  and  explicit-switch  multithreading 
models  the  run-lengths  usually  do  not  depend  on  the  number  of  threads  used.  As  long  as 
the  number  of  threads  is  kept  within  the  limits  set  by  the  available  parallelism,  the  number 
of  processors  used  does  not  have  much  impact  on  the  efficiency  results  obtained.  We  have 
chosen  to  increase  the  number  of  processors  because  it  makes  the  simulations  more  similar 
to  the  way  that  applications  will  be  run  on  real  machines  (with  many  processors). 
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Multithreading  Efficiency  Explicit-Switch 

(grouping  within  basic  blocks) 

100% . 


Figure  4.6:  Multithreading  levels  and  the  efficiencies  they  achieve  under 
explicit-switch.  The  bars  in  the  foreground  show  the  results  for  explicit- 
switch,  while  the  bars  in  the  background  show  the  results  for  switch-on- 
load  for  comparison. 
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tl  =  shared  x; 

if  (shared  flag) 

while  (i  >  0) 

if  (tl  >  xmax) 

sum  +=  shared  x; 

{ 

xmax  =  tl; 

sum  +=  shared_x[i]; 

t2  =  shared  y; 

i — ; 

if  (t2  >  ymax) 
ymax  =*  t2; 

} 

(a)  Code  Motion  (b)  Speculative  Loading  (c)  Loop  Unrolling 


Figure  4.7:  Example  code  fragments  with  potential  for  inter-block  grouping. 

for  these  applications  and  raise  their  performance  to  be  comparable  with  the  rest  of  the 
applications. 

4.3.2  Grouping  Beyond  Basic  Blocks 

In  the  previous  section,  our  code  reorganization  and  grouping  of  shared  loads  was 
done  only  within  basic  blocks.  Compiler  based  optimization  could  do  better  by  looking 
beyond  the  scope  of  a  single  basic  block. 

Figure  4.7  shows  three  simplified  examples  of  situations  taken  from  the  ugray  and 
locus  applications  that  would  be  amenable  to  inter-block  grouping  by  a  good  optimizing 
compiler.  In  these  examples,  shared  variable  are  prefixed  with  “shared-”,  and  all  other 
variables  are  local.  In  example  (a),  the  loading  of  shared_y  can  be  moved  upward  past  the 
conditional  test  and  grouped  with  the  loading  of  shared_x.  In  example  (b),  the  loading 
of  shared_x  could  be  moved  ahead  of  the  if  statement  and  grouped  with  the  loading  of 
shared-flag.  This  is  called  a  speculative  load  since  it  is  done  on  the  speculation  that  the 
conditional  test  will  be  true  and  that  the  load  will  in  fact  be  needed.  In  example  (c),  several 
iterations  of  the  loop  could  be  unrolled  and  the  exposed  multiple  loads  from  the  shared_x 
array  could  then  be  grouped. 

Code  motion  and  loop  unrolling  are  standard  optimizations  for  a  good  optimizing 
compiler.  Speculative  loading,  however,  is  trickier.  It  might  be  the  case  that  the  conditional 
test  checks  the  boundary  conditions  of  an  array.  If  the  load  is  moved  before  the  boundary 
check,  it  might  access  off  the  end  of  the  array  and  cause  an  unwarranted  memory  trap. 
Rogers  and  Li[RL92]  have  proposed  a  simple  mechanism  of  dealing  with  this  problem  by 
adding  a  poison  bit  to  each  register  and  taking  a  trap  only  upon  the  use  of  a  poisoned 
register.  A  further  problem  arises  if  speculative  loads  are  used  indiscriminately.  If  many  of 
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Factor 


sieve 

blkmat 

sor 

ugray 

water 

locus 

mp3d 

barnes 
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Run-Length 
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Explicit-Switch 


Figure  4.8:  Multithreading  levels  and  the  efficiencies  they  achieve  under 
explicit-switch  with  estimated  inter-block  grouping.  The  bars  in  the  fore¬ 
ground  show  the  results  for  explicit-switch  with  inter-block  grouping,  while 

the  bars  in  the  background  show  the  earlier  results  for  explicit-switch  with¬ 
out  it. 
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same  compiler  unrolling  technique  could  group  these  loads  as  well,  but  they  were  missed 

by  the  cache.  Thus  for  locus  our  experiment  underestimated  the  potential  for  interblock 
grouping. 

For  the  toy  applications  (sieve,  blkmat,  and  sor),  we  have  also  verified  that 
inter-block  grouping  is  possible.  In  sieve  this  would  involve  inter-procedural  analysis.  In 
blkmat  it  involves  a  complex  code  motion.  And  in  sor  it  involves  a  simple  loop  unrolling. 

Figure  4.8  shows  that  with  the  addition  of  inter-block  grouping,  all  of  the  applica¬ 
tions  can  now^obtain  efficiencies  near  or  above  80%  using  10  threads  or  less  per  processor. 
In  particular,  notice  the  dramatic  improvement  of  locus  because  of  the  grouping  made 
possible  by  loop  unrolling 

4.4  Conclusions 

In  this  chapter  we  have  shown  that  multithreading  is  effective  at  hiding  long  la¬ 
tencies  to  shared  memory.  The  switch-on-load  model  performs  poorly  for  applications 
that  access  memory  frequently,  but  the  explicit-switch  model  solves  this  problem  by  al¬ 
lowing  the  grouping  of  independent  loads  and  thereby  eliminates  many  extraneous  context 
switches.  For  most  of  our  applications  grouping  within  basic  blocks  is  adequate,  and  for 
the  others  there  do  exist  inter-block  grouping  opportunities.  Further  research  in  compiler 
optimization  is  needed  to  fully  explore  the  grouping  of  accesses. 

Simulation  results  indicate  that  a  multithreading  level  of  10  threads  per  processor 
is  adequate  for  hiding  a  200  cycle  remote  reference  latency,  and  that  we  can  expect  effi¬ 
ciencies  of  80%  or  better  from  a  multithreaded  parallel  machine.  This  machine  provides 
no  hardware  caching  of  shared  data,  and  thus  it  does  not  have  the  complexity  of  providing 
cache  coherency.  The  one  drawback,  which  is  the  subject  Chapter  6,  is  that  all  accesses  to 
shared  data  are  sent  across  the  interconnection  network,  and  thus  the  network  bandwidth 
requirements  will  be  high. 


63 


Chapter  5 


Multithreading  With  Cachin 
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Experiment:  caching 


Application 

Processors 

Multithreading 

sieve 

128 

1 

blkmat 

64 

1 

sor 

16 

1 

ugray 

32 

1 

water 

29 

1 

locus 

10 

1 

mp3d 

32 

1 

barnes 

32 

1 

•  Latency  =  200  cycles 

•  Each  processor  has  a  64K  byte  cache 
with  a  16  byte  line  size  and  4  way  set 
associativity. 


Table  5.1:  Experimental  parameters  for  caching  without  multithreading. 


comparison  with  other  research,  however  this  may  not  be  the  most  cost  effective  choice 
because  of  its  large  hardware  cost[ON90]. 

For  simplicity,  we  continue  to  assume  a  latency  of  200  cycles  for  all  network  refer¬ 
ences.  In  reality,  references  causing  coherency  traffic  would  take  longer  than  other  references 
because  of  the  additional  message(s)  sent  to  maintain  coherency.  For  example,  a  straight¬ 
forward  implementation  of  invalidations  would  take  two  round-trip  message  times  (four 
messages):  the  request  message  from  the  processor  to  the  memory,  the  invalidation  mes¬ 
sage  from  the  memory  to  the  invalidation  site,  the  acknowledgment  message  back  to  the 
memory,  and  finally  the  response  message  back  to  the  processor.  However,  a  smarter  imple¬ 
mentation,  such  as  the  DASH  protocol[LLG"*"90],  can  reduce  this  from  four  message  times 
to  three.  Furthermore,  in  their  prototype  implementation[LLJ+92]  they  found  the  extra 
latency  of  a  reference  requiring  coherency  to  be  only  30%  over  that  for  a  normal  reference. 
Our  constant  latency  assumption  is  thus  slightly  optimistic. 

Table  5.2  shows  the  simulation  results.  For  most  of  the  applications  the  miss  rates 
are  just  a  few  percent  and  caching  performs  well.  The  two  exceptions  are  mp3  ::  and  blkmat. 

Mp3d  has  low  reuse  of  data2,  and  its  high  miss  rate  is  a  result  of  this.  It  also  has  a 
high  access  rate,  and  thus  despite  the  presence  of  caches,  it  still  sends  a  large  number  of  ac¬ 
cesses  into  the  network.  Without  multithreading,  it  achieves  an  execution  efficiency  of  only 
15%.  Gupta  et.  a/.[GHG+91]  obtained  a  processor  utilization  of  26%  for  this  application  on 
their  simulations  of  the  DASH  multiprocessor3.  They  assumed  a  latency  of  less  than  half 
2See  Section  2.2.7. 

3 This  value  was  calculated  based  on  their  results  under  release  consistency,  which  is  similar  to  our 
assumption  of  weak  consistency. 
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Application 

Miss  Rate 

Efficiency 

sieve 

0.3% 

89% 

blkmat 

42.5% 

59% 

sor 

0.9% 

67% 

ugray 

3.9% 

63% 

water 

2.9% 

73% 

locus 

1.8% 

65% 

mp3d 

16.0% 

15% 

barnes 

2.3% 

78% 

.  Table  5.2:  Average  miss  rates  and  execution  efficiencies  on  a  machine  with 
64K  byte  caches,  200  cycle  latency,  but  no  multithreading. 


of  what  we  did,  and  thus  our  lower  efficiency  is  to  be  expected. 

Blkmat  also  has  a  high  miss  rate  (42.5%),  but  because  it  has  a  low  access  rate,  the 
resultant  access  rate  is  low  enough  to  allow  it  to  achieve  59%  efficiency.  Bltanat  has  a  low 
access  rate  because  it  was  programmed  to  make  local  copies  of  shared  data.  These  local 
copies  can  be  thought  of  as  software  caching,  and  thus  the  hardware  cache  is  superfluous. 

For  the  other  applications,  the  efficiencies  are  in  the  60%  to  70%  range.  These 
efficiencies  are  acceptable  for  large  parallel  machines.  For  instance,  executing  at  70%  effi¬ 
ciency  on  a  1000  processor  machines  would  give  a  speedup  of  700.  We  thus  conclude  that 
multithreading  is  not  essential  when  caching  is  provided. 

Gupta  et.  a/.[GHG+91]  obtained  quite  different  results  in  their  studies  of  cache 
coherent  multiprocessors.  They  looked  at  just  three  applications:  mp3d,  pthor,  and  lu. 
These  all  have  high  miss  rates  and  low  execution  efficiencies,  as  mp3d  does  in  our  studies. 
In  our  larger  application  suite,  mp3d  is  the  exceptional  case. 

Most  of  our  applications  achieve  acceptable  execution  efficiencies,  but  there  is  still 
significant  performance  loss  due  to  latency.  Thus  there  is  an  opportunity  for  multithreading 
to  help  push  execution  efficiencies  higher.  In  the  subsequent  sections  we  look  at  the  per¬ 
formance  improvements  that  can  be  obtained  by  using  multithreading  to  hide  the  network 
latency. 
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sieve 

128 

blkmat 

64 

sor 

16 

ugray 

32 

water 

29 

locus 

10 

mp3d 

32 

barnes 

32 

1 

3 

4 
3 
3 
2 

11 

2 


•  Latency  =  200  cycles 

•  Context  switch  =  3  cycles  if  caused  by  a 

miss,  0  cycles  if  caused  by  the  scheduling 
policy  3 

•  Scheduling  =  lock-priority  +  spin. 
switch4 

•  Each  processor  has  a  64K  byte  cache 
with  a  16  byte  line  size  and  4  way  set 
associativity. 


Table  5.3:  Experimental  parameters  for  measuring  run-lengths  under 
switch-on-miss. 


5.2  Run-Lengths  with  Caching 


to  wh  T  VCry  "Cren‘  Whe°  ,hereiS  *  CMh*  <*  «“  .  —pared 

when  there  ts  no,.  Without  a  cache,  under  expiici, -switch,  context  switches  occur  a, 

rates  rangtng  from  once  every  30  cycles,  once  every  300  cycle.  However  with  caches  we 

c^w  expect  most  of  the  previous  context  switches  to  be  avoided  and  the  mean  ml 

ngths  between  context  switches  to  rise  considerably.  Rather  than  multithreading  many 

breads  m  order  to  hide  each  other’s  latency,  we  will  need  perhaps  only  two  threads  per 

p  ocessor  so  that  one  can  execute  while  the  other  is  waiting  on  memory. 

...  ,.  Th;  “Periments  Table  5.3  were  used  to  measure  the  run-length  dis- 

^bupons  of  the  applications.  The  multiple  threads  on  a  processor  all  share  the  cache, 
thus  they  may  mterfere  with  each  others'  cached  data.  The  miss  rates  will  thus  be 
gher  under  mult. threaded  execution  than  the  miss  rates  listed  in  Table  5.2  for  execution 
w„hou,  mult, threading’.  The  differing  miss  rates  imply  differing  run-lengths,  and  thus  the 
un-length  dtstnbu, ions  with  caching  wil,  vary  based  on  the  level  of  multithreading  and 

foundTot  ’  ^  ga‘iered  'he  rU"'le"g,hS  *  ,hC  levels  tha,  were 

iound  to  be  appropriate  for  each  application. 

FignreS  5,1  k  5-2  ShoW  the  run'length  distributions  under  switch-on-miss.  The 

^  ^  ^  *  -  -  chosen 

ection  7.3  studies  the  increase  in  miss  rates  from  multithreading. 
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Figure  5.1:  Histograms  of  the  run-lengths  distributions  of  the  applications 
running  under  switch-on-miss. 
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Figure  5.2:  Histograms  of  the  run-lengths  distributions  of  the  applications 
running  under  switch-on-miss. 
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mean  run-lengths  are  now  above  200  cydes  (except  for  mp3d).  If  the  run-length  distributions 
were  constant,  M  =  2  would  be  sufficient  to  completely  hide  the  200  cycle  latency,  but 
unfortunately  this  is  not  the  case.  There  are  still  many  short  run-lengths  where  misses 
occur  on  successive  references.  The  net  effect  of  the  cache  is  that  it  raises  the  average 
run-lengths  and  spreads  out  the  run-length  distributions.  When  long  sequences  of  accesses 
hit  in  the  cache,  there  are  run-lengths  that  last  for  thousands  or  even  tens  of  thousands  of 
cycles. 

5.2.1  Smarter  Scheduling 

The  disparity  in  run-lengths  suggests  that  a  simple  round-robin  scheduling  policy 
may  no  longer  be  the  best  choice.  Long  run-lengths  can  cause  problems  because  they  block 
out  other  threads  from  the  processor.  Consider  the  following  scenarios  with  two  threads  on 
a  processor: 

unbalanced  scenario:  Thread  A  is  executing  with  long  run-lengths  taking  thousands  of 
cydes,  while  thread  B  is  executing  short  run-lengths  of  just  20  cycles.  A  good  schedul¬ 
ing  policy  should  switch  out  thread  A  whenever  thread  B  is  ready  to  run.  This  allows 
hiding  the  latency  from  as  many  of  B's  references  as  possible. 

locking  scenario:  Thread  A  is  executing  with  long  run-lengths,  while  thread  B  is  at¬ 
tempting  to  obtain  a  lock,  do  a  few  critical  operations,  and  release  the  lock.  In  order 
to  minimize  contention  for  the  critical  region,  it  is  important  for  B  to  hold  the  lock 
for  as  short  a  time  as  possible.  Idealy  thread  A  should  be  switched  out  when  thread 
B  is  ready  to  run,  giving  B  priority  when  it  is  holding  a  lock. 

spinning  scenario:  If  thread  B  is  spinning  while  waiting  for  some  event  to  happen,  it 
should  be  given  lower  priority  so  that  thread  A,  which  is  doing  useful  work,  can  make 
progress.  In  fact,  it  is  essential  that  A  be  given  access  to  the  processor  since  B  might 
be  waiting  on  an  event  that  will  be  caused  by  A.6 

These  scenarios  all  suggest  that  context  switching  must  be  done  more  often  than 
just  on  cache  misses.  In  fact,  long  run-lengths  can  be  broken  into  several  smaller  and  more 

®Spinning  is  a  bad  idea  on  a  multithreaded  processor  since  the  processor  will  usually  have  work  that  can 
be  done  by  another  thread.  In  Section  7.1.2  we  will  discuss  the  implementation  of  synchronization  primatives 
that  do  not  involve  spinning. 
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uniform  run-lengths  to  help  improve  their  latency  hiding  capacity.  Below  are  a  number 
of  basic  scheduling  policies  that  we  studied  alone  and  in  combination  with  each  other. 
These  policies  all  switch  on  cache  misses,  but  also  context  switch  for  the  additional  reasons 
specified  by  the  policies. 

Basic  Scheduling  Policies: 

spin-switch:  Spinning  threads  are  switched  out  after  every  shared  memory  load  instruc¬ 
tion.  This  minimizes  the  number  of  execution  cycles  wasted  by  spinning  threads. 

timeout(N):  Threads  are  forced  to  switch  after  they  have  held  the  processor  for  N  cycles. 
(Tried  with  N  ranging  from  10  to  200.) 

lock-priority:  Threads  holding  a  lock  are  given  preemptive  priority.  This  allows  a  thread 
to  execute  and  exit  a  critical  region  as  quickly  as  possible. 

new-priority:  Newly  ready  threads  (those  having  just  received  a  result  from  a  remote 
reference)  are  given  preemptive  priority.  The  object  is  to  give  priority  to  those  threads 
that  are  executing  with  short  run-lengths. 

always-switch:  Threads  are  context  switched  after  every  shared  memory  load  instruction 
regardless  of  whether  it  missed  in  the  cache.  This  is  a  simple  policy  that  gives  all 
threads  frequent  access  to  the  processor. 

Table  5.4  shows  the  execution  efficiencies  under  some  of  the  scheduling  policies  that 
we  studied.  These  simulations  are  for  the  switch-on- miss  model,  which  will  be  discussed 
in  the  next  section.  We  present  these  scheduling  results  first  because  the  best  scheduling 
policy  found  here  will  be  used  in  the  next  section  for  the  switch-on-miss  simulations. 
Experimental  parameters  are  specified  in  Table  5.5. 

Overall,  the  best  policy  that  we  studied  was  one  that  combined  timeout(lOO), 
lock-priority,  and  spin-switch.  This  was  selected  as  best  based  on  averaging  the  execution 
efficiencies  of  all  of  the  applications  except  sieve  and  mp3d.  Sieve  was  excluded  because 
it  runs  well  without  multithreading,  and  thus  the  scheduling  policy  is  irrelevant  when  there 
is  only  one  thread  on  a  processor.  Mp3d  was  excluded  because  we  will  see  in  Chapter  6  that 
its  performance  will  likely  be  constrained  by  bandwidth  rather  than  latency. 
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Application 

Multithreading 

spin-switch 

lock-priority 
-f  spin-switch 

new-priority 

4-  spin-switch 

u - . - - - 

u 

to 

00 

•73 

timeout(lOO) 

timeout(lOO) 

+  lock-priority 
-f  spin-switch 

blkmat 

sor 

ugray 

water 

locus 

mp3d 

barnes 

M=3 

M=4 

M=3 

M=3 

M=3 

M=ll 

M=2 

75.3 

79.7 

73.6 

89.1 

74.5 

84.5 
80.0 

75.6 

79.7 
81.9 

92.8 
75.3 
84.5 

80.9 

76.6 
84.4 
89.3 

92.8 

84.8 

83.7 
82.0 

77.3 

89.6 

88.3 
92.1 

85.6 

92.4 

82.5 

76.2 
88.8 

89.4 

93.4 

85.6 

84.6 

82.3 

76.2 

88.8 

89.8 

93.5 

89.5 

84.6  , 
82.4 

Average 

(excludi 

ng  mp3d) 

78.7 

81.0 

85.0 

85.9 

86.0 

86.7 

Table  5.4:  Execution  efficiencies  under  various  scheduling  policies. 


ExDeriment:  scheduling  under  switch- 

on-miss 

IV-  - 

Application 

Processors 

Multithreading 

•  Latency  =  200  cycles 

blkmat 

64 

3 

•  Context  switch  =  3  cycles  if  caused  by  a 

sor 

16 

4 

miss,  0  cycles  if  caused  by  the  scheduling 

ugray 

32 

3 

policy 

water 

29 

3 

•  Scheduling  =  experimental  parameter 

locus 

10 

3 

•  Each  processor  has  a  64K  byte  cache 

mp3d 

32 

11 

with  a  16  byte  line  size  and  4  way  set 

barnes 

32 

2 

associativity. 

- - - - - 

Table  5.5:  Experimental  parameters  for  evaluating  scheduling  policies  under 
switch-on-miss. 
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Experiment:  switch-on-miss 

Application 

Processors 

Multithreading 

•  Latency  =  200  cycles 

•  Context  switch  =  3  cycles  if  caused  by  a 
miss,  0  cycles  if  caused  by  the  scheduling 
policy 

•  Scheduling  =  timeout(lOO)  +  lock- 
priority  +  spin-switch 

•  Each  processor  has  a  64K  byte  cache 
with  a  16  byte  line  size  and  4  way  set 
associativity. 

sieve 

blkmat 

sor 

ugray 

water 

locus 

mp3d 

barnes 

128 

64 

16 

32 

29 

10 

32 

32 

1 

1-4 

1-4 

1-4 

1-3 

1-3 

1-11 

1-3 

Table  5.6:  Experimental  parameters  for  switch-on-miss. 


Many  other  scheduling  policies  performed  nearly  as  well  as  the  chosen  policy.  In 
fact,  simple  policies  such  as  timeout(lOO)  or  always-switch  performed  within  1%  of  the 
chosen  policy  on  average.  These  policies  address  the  three  scenarios  given  above  because 
they  limit  the  interval  in  which  a  thread  can  dominate  the  processor.  We  thus  conclude 
that  choosing  a  particular  scheduling  policy  is  not  critically  important  and  can  be  based  on 
what  the  hardware  designer  finds  most  convenient. 


5.3  Switch-On-Miss 

The  experimental  parameters  used  in  our  simulations  of  switch-on-miss  are 
shown  in  Table  5.6.  The  context  switch  cost  was  3  cycles  if  caused  by  a  cache  miss,  but 
0  cycles  if  forced  by  the  scheduler  because  of  some  scheduling  policy  related  decision  such 
as  a  preemption  or  timeout.  The  differing  context  switch  times  depend  upon  whether  the 
context  switch  decision  is  made  early  (scheduler)  or  late  (cache  miss)  in  the  pipeline.  This 
is  explained  in  Chapter  8. 

Figure  5.3  shows  the  execution  efficiencies  at  various  multithreading  levels.  The 
bars  with  M  —  1  are  the  results  that  were  presented  in  Section  5.1  for  caching  without  mul¬ 
tithreading.  A  few  bars,  such  as  M  =  3  and  M  =  4  for  blkmat,  are  unlabeled  because  there 
was  not  sufficient  room  to  insert  the  labels.  In  all  cases,  these  unlabeled  bars  correspond 
to  the  next  sequential  multithreading  level. 

At  M  =  1,  most  of  the  applications  perform  in  the  60%  to  70%  efficiency  range,  and 
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the  addition  of  multithreading  raises  the  performance  to  the  80%  to  90%  range.  Expressed 
in  terms  of  relative  performance  (multithreaded  performance/single  threaded  performance), 
multithreading  provides  a  30%  to  40%  performance  increase  for  most  applications.  There 
are  three  exceptions.  Sieve  caches  extremely  well  and  thus  has  no  use  for  multithread¬ 
ing.  Barnes  has  a  large  performance  loss  due  to  synchronization,  which  is  not  helped  by 
multithreading,  and  mp3d  caches  poorly  and  thus  has  room  for  and  achieves  much  larger 
performance  gains  from  multithreading. 

The  number  of  threads  used  for  sor  is  small  because  of  the  sensitivity  of  its 
performance  to  the  degree  of  parallelization.  It  partitions  the  192  by  192  grid  into  as  many 
square  (or  rectangular)  regions  as  there  are  threads.  Cache  interactions  occur  just  along 
the  edges  of  these  regions  because  the  algorithm  accesses  only  neighboring  values  in  the 
grid.  The  cache  hit  rate  is  thus  strongly  affected  by  the  size  of  the  regions.  To  allow 
a  fair  comparison  between  switch-on-miss  and  explicit-switch,  we  kept  the  number  of 
processors  the  same.  This  lets  switch-on-miss  receive  the  benefit  of  requiring  fewer  threads 
and  thus  having  larger  regions  for  a  given  problem  size.  In  the  configuration  used  here  (P 
=  16,  M  =  4),  /sor/  runs  at  89%  efficiency.  With  more  processors  and  threads  (P  =  64,  M 
=  4),  and  thus  finer  partioning,  efficiency  drops  to  75%. 

Compared  to  the  results  for  multithreading  without  caching  (from  Chapter  4), 
the  execution  efficiencies  vary  from  a  few  percent  worse  to  15%  better,  depending  on  the 
application.  The  big  change  is  that  since  run-lengths  are  much  longer  with  caching,  not  as 
many  threads  are  needed,  and  the  improvement  due  to  multithreading  is  much  less.  For 
most  of  the  applications,  multithreading  of  3  threads  per  processor  is  adequate  to  hide  the 
200  cycle  latency. 

5.4  Conditional-Switch 

Grouping  was  very  effective  at  improving  the  performance  and  decreasing  the 
multithreading  levels  needed  under  explicit-switch  compared  to  switch-on-load.  We  can 
apply  the  same  idea  to  a  caching  system  by  treating  the  switch  instructions  conditionally. 
Under  the  conditional-switch  model,  if  all  of  the  references  proceeding  a  switch  instruction 
hit  in  the  cache,  the  switch  instruction  is  ignored,  but  if  any  of  them  miss,  then  the  switch 
is  taken  in  order  to  wait  for  the  result(s).  The  potential  benefit  is  that  we  can  issue  more 
than  one  reference  per  thread  into  the  network  before  waiting  for  the  results  to  return. 


Latency  =  200  cycles 

•  Context  switch  = 

Sd*d*s  =  ^23% 

priority  +  spin-switch 

•  Each  processor  has  a  64K  k  * 

EXpetmWal  parameters  for  condit.onal  swi(cii 
Tie  experimental  parameters  for  cond!tl  ,  . 

Thlm  «"  -  "lose  for  switch-on-Il  '  ,,Ch  ~  8i°-  “  Table  5.7 

«««  «■«  timing  "„mprPt  “  "*"«  “dreading 

om  to  3  cycles  depending  on  when  the  cache  mi  ^  ^  C°nteXt  Switch  c°st  varies 

instruction.  ff  «he  cacle  ^  ^  -  occurs  relative  ,0  the  context  switch 

•ie  P'Pelae.  the  context  switch  can  be  hone  imllT"  enters 

etw,se,  the  context  switch  will  occ„r  deeper  jn  .  *  "  “  Was  r°'  explicit-switch. 

5.4  shows  the  performance  of  the  a  7  “  “  “  ‘'""“'-on-miss. 

^..threading.  These  simulations  were  run  using  T 
e  do  no.  have  a  compiler  that  can  do  inter-Mock ‘  T*  ^  bMk  since 

le  apphcations  have  equivalent  or  lower  performance  7*'  ^  ^  **"»*«.  ->  of 

for  mp3d.  Mp3d  is  an  exception  because  /does  nm  !  I  T"  except 

behavior  of  an  uncached  system.  "D  Md  retains  some  of  the 

The  lower  performance  indicates  that  „„  ■ 

-  mg.  This  occurs  because  grouping  is  7°'  -  -junction  with 

5e”  m‘°  tW  “*«rk  before  a  context  switch,  ifl  ,  7  “  ““  «*  reference  is 

om  e  is  working  well,  usually  all  or  most  of  th  f  °°  ^  *  gr°UP  °f  references  when  the 
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agams,  the  extra  cos,  „f  tie  added 
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context  switch  instructions.  These  extra  instructions  take  a  cycle  in  the  execution  stream 
regardless  of  whether  they  are  useful  or  not. 

5.5  Conclusions 

Caching  is  effective  for  most  of  our  applications.  We  observed  miss  rates  ranging 
from  1%  to  4%.  These  low  miss  rates  mean  that  threads  execute  for  longer  intervals  before 
context  switching  and  thus  fewer  threads  will  be  needed  to  hide  the  latency. 

However,  sometimes  these  long  execution  intervals  can  cause  performance  problems 
by  letting  one  thread  hold  the  processor  and  thereby  block  other  threads  from  executing. 
This  can  be  dealt  with  by  adding  a  timeout  or  other  mechanism  to  the  scheduling  policy. 

Our  simulations  show  that  a  machine  without  multithreading  can  obtain  efficien¬ 
cies  of  60%  to  70%  with  a  latency  of  200  cycles,  and  that  a  machine  with  switch-on-miss 
multithreading  using  3  threads  per  processor  can  boost  these  efficiencies  to  80%  to  90%. 

Finally,  our  simulations  of  the  conditional-switch  model  show  that  grouping  is 
not  beneficial  in  conjunction  with  caching. 
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Chapter  6 

Limited  Bandwidth 


In  the  previous  chapters  we  have  shown  that  the  long  latencies  of  the  communica¬ 
tion  network  can  be  tolerated  by  using  multithreading  techniques.  In  this  chapter  we  look 
at  the  other  mam  characteristic  of  a  communication  network:  bandwidth. 

Where  as  long  latencies  are  inevitable  because  of  the  large  number  of  processors 
and  memories  that  must  be  connected  together,  the  bandwidth  capacity  of  a  network  can 
be  increased  by  spending  more  money  and  adding  more  wires  and/or  switches.  Unfortu¬ 
nately,  as  machines  grow,  the  network  becomes  a  larger  and  larger  fraction  of  the  total 
system  hardware.  For  example  on  indirect  networks  such  as  butterflys  and  fat-trees[Lei85], 
0{p\ogp)  routing  nodes  are  used  to  connect  p  processors.  For  direct  networks,  the  number 
of  routing  nodes  is  the  same  as  the  number  of  processors,  but  if  you  count  pins  and  wires, 
the  amount  of  hardware  increases  for  direct  networks  as  well.  On  a  hypercube,  the  degree 
of  the  routing  nodes  increases  as  O(logp).  On  a  2-D  mesh,  the  width  of  the  channels  must 
grow  as  0(y/p)  if  a  fixed  bisection  bandwidth/processor  is  to  be  maintained.  The  bottom 
line  is  that  for  a  large  machine,  the  network  will  be  expensive,  and  therefore  we  need  to 
understand  and  minimize  the  bandwidth  demands  put  upon  it. 

In  this  chapter  we  present  the  bandwidth  needs  of  our  benchmark  application 
suite  under  the  explicit-switch  and  switch-on-miss  multithreading  models.  Our  results 
wiH  show  that  caching  substantiaUy  reduces  the  the  network  bandwidth  needed.  We  then 
look  more  closely  at  the  traffic  patterns  of  switch-on-miss  systems.  The  traffic  on  these 
systems  wiU  be  bursty  and  thus  some  execution  periods  wiU  need  more  network  bandwidth 
that  others.  We  measure  this  burstiness  and  use  the  results  along  with  a  performance  model 
to  suggest  the  level  of  bandwidth  that  should  be  suplied  by  the  network. 
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Experiment:  remote  memory  bandwidth  needs  of  exp  licit- switch 

Application 

sieve 

blkmat 

Processors 

32 

64 

16 

12 

Multithreading 

5 

3 

Q 

•  Latency  =  200  cycles 

•  Context  switch  =  1  cycle 

•  Scheduling  =  round  robin 

sor 

ugray 

10 

•  Inter- block  grouping  estimates  as  in  Sec¬ 
tion  Section-IBG1 

water 

20 

5 

locus 

8 

7 

mp3d 

32 

11 

barnes 

16 

5 

Table  6.1:  Experimental  parameters  for  measuring  tbe  remote  memory  band¬ 
width  needs  of  the  applications  under  explicit-switch. 


6.1  Bandwidth  Requirement 

The  bandwidth  which  an  application  uses  depends  upon  a  number  of  factors.  First, 
the  application  may  be  either  computationally  or  communication  intensive.  Second,  if  the 
machine  provides  caching,  much  of  the  potential  traffic  may  get  filtered  out  by  the  cache. 
And  third,  if  the  processor  is  multithreaded,  the  higher  processor  utilization  and  thus  higher 
computational  rate  will  increase  the  bandwidth  requirement. 

6.1.1  Bandwidth  Requirement  Without  Caching 

We  measured  the  bandwidth  requirements  of  the  applications  by  summing  the 
sizes  of  all  messages  sent  through  the  network.  This  gave  us  the  total  traffic  used  by  an 
application.  We  then  normalized  this  to  bits/cycle/processor  by  dividing  the  total  traffic 
by  the  execution  time  and  by  the  number  of  processors.  We  call  this  the  remote  memory 
bandwidth. 

Table  6.1  lists  the  simulation  parameters.  We  measured  the  bandwidths  of 
explicit-switch  with  inter-block  grouping,  which  was  the  best  performing  multithread¬ 
ing  model  (without  caching).  The  bandwidths  were  computed  based  on  the  message  sizes 
shown  in  Figure  6.1.  These  messages  are  used  for  sending  loads  and  stores  and  for  returning 
their  results.  The  first  field  in  a  message  is  its  destination  memory  module  or  processor. 

‘For  the  inter-block  grouping  estimates  we  used  a  one  line  cache  for  each  thread.  This  affected  the 
grouping  but  not  the  bandwidth  results.  The  bandwidth  was  calculated  as  if  all  messages  were  sent  into  the 
network  and  no  caching  was  present. 
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bits:  0  32  64  96  128 

I  I  l  I  I  I  i  I  I  I  I  I  i  I  I  I 

load  word 
P  —  M 
P  —  M 
load  double 


mem 

tag 

- 1 

°R 

32  bit  addr 

P  —  M 

proc 

tag  op  64  bit  data 

store  word 

P  —  M 

Imem 

tag  op  32  bit  addr 

32  bit  data 

P  —  M  proc 

store  double 

tag  op  (ack) 

P  —  M 

mem 

tag  op  32  bit  addr 

64  bit  data 

P  —  M 

proc 

tag  op  (ack) 

mem 

tag 

op 

32  bit  addr 

proc 

tag 

op 

32  bit  data 

Figure  6.1:  Message  sizes  for  remote  references  to  shared  memory. 

Next  is  an  8  bit  tag  field  that  is  used  to  identify  results  as  they  are  returned2.  Then  is  an 
8  bit  opcode  that  specifies  the  operation  type  and  message  size.  The  last  field(s)  is  either 
the  address  being  referenced,  the  data  returned,  or  the  address  and  data  for  a  write. 

These  messages  sizes  axe  at  the  small  end  of  the  spectrum  of  possible  implemen¬ 
tations.  For  instance,  we  have  assumed  that  the  only  routing  information  needed  is  the 
number  of  the  destination  memory  bank  or  processor,  and  that  the  return  address  can  be 
generated  as  the  message  is  routed[GGK'*'82].  Also  we  have  used  32  bit  addresses,  whereas 
a  large  parallel  machine  will  likely  support  a  larger  address  space.  To  apply  our  simula¬ 
tion  results  to  a  machine  using  larger  messages,  our  bandwidth  results  should  be  scaled  up 
proportionally  to  the  increase  in  message  sizes. 

Table  6.2  shows  the  remote  memory  bandwidth  results.  The  bandwidths  vary 
considerably  by  application,  and  range  as  high  as  30  bits/cycle/proc  to  as  low  as  1.44 
bits/cycle/proc.  For  comparison,  Table  6.3  shows  the  bisection  bandwidths  of  proposed 
and  existing  machines.  These  bandwidths  are  for  machines  scaled  to  1024  processor  and 
are  taken  from  Figure  1.4  in  Chapter  1. 

At  first  glance,  our  measurements  of  remote  memory  bandwidth  in  Table  6.2  may 


2See  Section  8.1.2. 


Application 

Remote  Memory  Bandwidth 
(bits/cycle/proc) 

sieve 

9.80 

blkmat 

1.44 

sor 

30.20 

ugray 

6.16 

water 

3.59 

locus 

15.07 

mp3d 

19.91 

barnes 

3.08 

Table  6.2:  Average  remote  memory  bandwidth  needs  of  applications  under 
explicit-switch  multithreading. 


Machine 

Bisection  Bandwidth 

(P  =  1024) 

(bits/op/proc) 

TERA 

55.0 

CM-5 

2.5 

DASH 

1.8 

KSRl 

1.6 

Table  6.3:  Bisection  bandwidths  of  proposed  and  existing  machines  if  scaled 
to  1024  processors. 
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Figure  6.2:  A  2-D  mesh  network  and  its  bisection. 
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no.  seem  directly  comparable  to  .be  bisection  bandwid.hs  in  Table  6  3  Th 
bandwidth  denotes  the  total  network  bandwidth  used  bv  th  ,  "  ^  remo,e 

bandwidth  denotes  the  bandwidth  between  two  halves  7”  whereas  bis«ion 

bisection  bandwidth  nsed  depends  upon  the  network  and  tr^T  '  ^  °f 

Figure  6.2  shows  a  2-D  mesh  network  with  a  dash  Jr  j  paUera5-  For  example, 

is  laid  out  so  that  most  traffic  is  to  nearb  a  fawn  across  its  bisection.  If  data 

For  some  applications  with  regular  comma!!, 2n  pZe™  llT  ^  ^ 

possible  if  the  network  topology  matches  the  comma  II  ^  Iay°“tS  « 

applications,  the  communication  patterns  and  d  t  P  However  for  ma"y 

-  ■ 

tate  performance  by  placing  sevtjwOy  ^  ky°UtS  ^  ^ 

random  data  layout,  half  of  traffic  th  ^  ‘°  ^  W  Iayouts*  With 

bandwidth  of  X  would  be  s”!!  “  “*  ‘b»a  a  Section 

tWnsore,  the  comparison  between  ^  ^ 

IpIcity.d°Ser  *ban  tb*s  Pactor  two  because  networks  do  not  nchievelhelpeaklandwidth 

as  30  °M<1S  ^ aPPliCati0,1S  <“  biSb 

1  -  *  bits/operation /proc,  from  ^  ~  <-‘ 

be  inadequate  for  an  explicit-switch  tr  J  '  C°ncl,ldc  that  these  networks  will 
handle  these  bandwidths  is  the  or  T  t  5ySt'm'  The  “"'b  network  which  can 

— — —  m:ioii“r;:rz“ is  *  ** *  — 

actually  needed,  but  was  nurno^f  n  A  •  Perhaps  more  bandwidth  than  is 

For  the  other  networks  there  must!  y  “  “°‘  ‘°  be  b“dwidth  limited[Smi92J. 

men,.  4*  501,16  mKha"ism  deducing  the  bandwidth  require- 


p'”» th**  -  z.'ss&tss*  oSr  rS 
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bits:  0 
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P  M  [proc  [taglopj  (ack) 


mem  [tagj  opf  32  bit  addr 
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C  “*■  M  I. proc  |tag|  op  1 32  bit  addr  | 


cache  writeback 

c  M  S^Hliia[°pi32biridd? 


□ 


Figure  6.3:  Messages  used  to  support  coherent  caching. 

6.1.2  Bandwidth  Requirement  With  Caching 

In  Chapter  5  most  applications  showed  high  cache  hit  rates  and  thus  caching  should 
e  effective  at  reducing  the  network  bandwidth  requirements.  The  bandwidth  reductions 

h°l:ZI  h  ”0t  T  “  kiSh  “  “e  ^  ^  ~  ««  sizes,  and 

access  d “  m°re  ‘°  tranSmi‘  ,he  **  ««<  memory 

cal“;  With0U,  CaChi”8'  ^  *"*  ■—  maintain 

or  cache  we  have  7med  for 

/F.  .  ,1  ,  Slmilar  to  the  messages  for  a  non-caching  system 

.  S.  ”  l]'l >U‘  “°W  the  ™m°T*  “  entire  cache  line  of  data  rather  than  just  a 

messages  ^that ^  "  *°  "“h  “  and  recall 

messages,  that  are  used  to  maintain  cache  coherency. 

Table  6.4  shows  the  simulation  parameters,  and  Table  6.5  shows  the  bandwidth 
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blkmat 

sor 

ugray 

water 

locus 

mp3d 

barnes 


-on-miss 

•  latency  =  200  cycles 

p*;"  cycIes 

'  pZi?ls  =  tim^m  +  lock. 

priority  -f  spm-switch 
•  Each  processor  has  a  64K  bvt*  u 
with  a  16  byte  line  size  and^way^set 

associativity.  y  et 


TaWe  6.4:  Experimental  parameters  for  •  "  "  '  ‘ 

switch-on-miss.  measuring  bandwidth  under 


f  Application 

Bandwidth 

(bits/ cycle/Dror'l 

sieve 

'  1 _ / 

0J0 

blkmat 

0.91 

sor 

1.06 

ugray 

1.09 

water 

0.50 

locus 

1.97 

mp3d 

14.61 

barnes 

0.18 

Table  6.5:  Average  remote 
switch-on-miss. 


memory  bandwidth  needs  of 


applications  under 


results  measured  for 

all  of  the>  ,  SWJtch-°n-miss  Tfc  u  85 

TabJe  «  2)  ^  "  “**  **  Til  ^  2  0  bl,  . 

m  30  ,o  ztt*  ,ot 

.  7  ™  *■*  caching.  faction  to  ,  .  '**■*  (see 

-  If  :•  -  r-  •■»» 

Probably  not  h  h' M  that  0f  ti  wort  bandwidth  ”reai- 

_  y  not  be  provided  other  application  is  M 

6  2  SqUeeZin8  Through  a  L.  • 

Z‘7ri  beCa“M  ‘-fflc  wfflT 7’  giVe  M  ZTd‘h  °f  tie 

PPhcatjons  pass  though  e  bursty  rather  ^  He  demands  on  tle 

con.pnta.ion  or  counBni  ™  ““Potationai  phases,  some”*  T ^  °VW  «”*>• 

eWaIy  *<*-  the  collection  T  °'herS'  F“Hermore  the  ^  d°  »“'«  or  less 

Sabiect  ' °  Wsher  usage  ,ha„  "T^  “°dUfe-  &me  mem  “"“'“'''on  *■  not  spread 
s“«Ie  Shared  variable,  or  simp,”  /”  ^  Spot>  because  of  7  m°d,l,eS  to  be 

should  h  The  MtWOli  <*-  a!  ‘°  ra“d0m  C°iMW«ce.  ^  ^  — es  to  a 

bn,  wffl  I";”0"8'1  b^«m  toSi77e7CUae  "m  “«"«*  be  a  co 

be  a^e  satisfy  ai]  th  demands  of  most  .  mPremjSe.  ft 

b^i,  the  network  will  .  aPP"Clti°"s  »«  He  time  D„  °”  “°S‘  °f  He  time 

:7 *  stjneeze  through  ^  P"fe»“»  of  the  ma  ^  T* 

UfficKnt  bandwidth  that  th  ',e‘"'°rk-  A  Properly  ch  *  me  «  which 

He  overall  performance  loss'  *°*  pJ£ 

order  to  de,  'h'  f0"0’’i“*  “““ns  we  ,ate 

Z7W  UMW  WdtareV^  ^  h 

°"*  -pects  of  travel  ,  Ci‘,0nS' 

P  erns>  Particularly  hot  « 

y  hot  spot  references  to  shared  vari- 
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•  Latency  =  200  cycles 
Context  switch  =  3  cycles  if  caused  by  a 
mjss,  0  cycles  if  caused  by  the  scheduling 
policy 

Scheduling  =  timeout(lOO)  +  Jock- 
priority  +  spin-switch 
Each  processor  has  a  64K  byte  cache 
with  a  16  byte  line  size  and  4  way  set 
associativity. 


Table  6.6:  Experimental  parameters  for  bursty  traffic  under  switch-on- 
miss. 


Application 

Problem  Size 

Simulation 

Length 

(cycles) 

Processor 

Utilization 

Cache 
Miss  Ratp 

sor 

ugray 

barnes 

o2u  x  320  matrices 

768  x  768  grid 

gears  —  160  x  512  slice  of  image 
16,384  bodies 

1.8  M 
first  20  M 
first  20  M 
first  20  M 

93% 

94% 

88% 

72% 

35%  " 

1% 

12% 

15% 

Table  6.7:  Increased  problem  sizes  of  applications. 


ables,  .111  be  more  pronounced  in  larger  systems  with  more  processors.  For  this  reason  we 

lave  simulated  systems  as  iarge  as  possible.  The  simulations  were  for  256  processors  and 

are  speeded  m  Tab.e  6.6.  The  probiem  sices  were  increased  so  as  to  provide  enough  work 
to  allow  adequate  parallelism. 

We  selected  four  of  the  eight  benchmarks  for  the  studies  in  the  remainder  of  this 

d'7  K  ’  S°r’  “Sray'  “<i  barn8S-  ^  C°Uld  "°*  ^  ”  -««  because  we 

did  no,  have  large  enough  inputs  for  these  applications.  We  rejected  „p3d  because  i,  caches 

poorly  and  thus  is  Incompatible  with  a  switch-on-miss  parallel  machine,  and  we  rejected 

sxev.  because  its  bandwidth  usage  is  so  low  that  it  is  not  interesting  for  this  study. 

(excent  J^'lu  w  ‘he  U”fort“^.  *hese  larger  problems 

(  cep,  blhmat)  took  far  too  long  to  allow  executing  them  to  completion.  Thus  .or  ugray 

and  barne.  were  executed  only  for  the  firs.  20  million  cycles.  We  list  the  processor  uti’ 


BW  profile 
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BW 

constraint 


slowdown  of 
high  BW  phase 


Figure  6.4:  High  bandwidth  phases  will  slow  down  as  they  squeeze  through 
the  bandwidth  constrained  network. 


lizatious  over  this  period  sii.ce  the  execution  efficiencies  can  not  be  computed  unless  the 

applications  are  run  to  completion.  For  ugray  and  barnes  the  cache  miss  rates  have  in- 
creased  because  of  the  larger  problem  sizes. 

6.2.1  Remote  Memory  Bandwidth 

In  this  section  we  look  at  how  the  total  remote  memory  bandwidth  needs  of  the 
applications  vary  over  time.  We  will  predict  the  performance  that  will  be  achieved  on 
machines  with  limited  bandwidth  networks  by  developing  a  performance  model  and  then 
applying  it  to  the  bandwidth  needs  of  the  applications. 


Squeeze  Performance  Model 

Figure  6.4  shows  the  basic  idea  of  our  performance  mode!.  We  start  with  a  band¬ 
width  profile  of  an  application.  This  is  obtained  from  simulations,  and  it  shows  the  varying 
bandwidth  needs  of  the  application  as  a  function  of  time.  For  most  applications  the  band¬ 
width  will  not  be  uniform.  Instead,  the  applications  will  have  different  phases  with  different 
bandwidth  needs  as  shown  in  the  figure.  The  network  of  a  real  machine  will  have  amaximum 
bandwidth  capacity,  which  is  represented  in  the  figure  by  the  pipe  labeled  BW  constant. 
For  our  performance  model,  we  assume  that  during  phases  when  an  application  needs  less 
bandwidth  than  is  available,  it  will  execute  at  full  speed.  But  during  phases  when  then 
bandwidth  needs  exceed  the  network  bandwidth  capacity,  we  assume  that  execution  slows 
down  and  makes  progress  at  the  rate  as  which  messages  squeeze  through  the  network. 

Figure  6.5  formally  specifies  our  performance  model.  This  model  is  much  more 
accurate  than  simply  looking  at  the  average  bandwidth  over  the  entire  run  of  the  execution 
but  it  is  still  optimistic.  Under  some  adverse  traffic  patterns  there  may  be  some  links  of 
the  network  or  memory  modules  that  are  more  heavily  used  than  others.  These  will  be 
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Squeeze  Performance  Model 
slowdown  =  — — _ wnet  _ 

n 

£<■ 

«=i 

n  =  number  of  phases 
=  duration  of  phase  i 
bw,-  =  bandwidth  needed  in  phase  i 
bwDet  =  bandwidth  available 


Figure  6.5:  Performance  model  of  an  application  having  phases  with  varying 

bandwidth  needs  being  executed  on  a  machine  with  limited  network  band¬ 
width. 


bottlenecks  and  could  further  slow  down  the  execution  of  the  machine.  Hopefully,  such 

bottlenecks  will  be  rare  when  data  is  spread  randomly  across  the  machine  as  we’  have 
assumed. 


Simulation  Results 

In  practice,  applications  do  not  exhibit  long  uniform  phases  as  suggested  by  the 
squeeze  performance  model.  The  processors  are  all  semi-independent  systems  which  issue 
occasional  messages  into  the  network.  Together  they  form  a  very  bursty  system.  At  some 
particular  point  in  time,  there  might  be  a  large  burst  of  new  messages  resulting  from  random 
coincidence.  However,  this  burst  wiU  not  slow  down  the  machine  if  on  subsequent  cycles 

there  is  a  compensating  lull  in  new  traffic.  On  a  small  time  scale  the  network  and  its 
buffering  serve  to  smooth  out  the  traffic. 

To  take  into  account  this  natural  smoothing  of  the  traffic,  we  have  gathered  our 
simulation  data  over  intervals  of  100  cycles.  Much  shorter  sample  intervals  would  be  pes¬ 
simistic  since  they  would  report  bursts  of  traffic  that  could  be  smoothed  out  by  a  real 
network,  and  likewise  much  longer  sample  intervals  would  be  optimistic  since  they  would 
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smooth  over  long  bursts  of  traffic  that  on  a  real  network  would  have  a  oerfnr 

We  chose  the  value  of  100  cycles  because  it  is  half  of  the  expected  200  ,mP“' 

This  .atency  will  consist  of  both  physical  de,ays  and 

of  the  latency  due  to  each.  The  congestion  delay  is  caused  by  the  jostling  of  aPS 

:zzr ush  ,he  ~  - is  - — -  -  -  ^  ;:r: 

— -  - 

for  20,000,000  cycles.  The  vertical  bars  in  th  ’  ,  PPlrcat.ons  were  simulated 

-  >00  cycles.  Ouring  each  ~  'JT  ^ 

ZrZxts!  sample  Zf  “t  ^  ^  ^  ^ 

sample  values  were  then  normalized  to  bits/cycle/proc. 

These  bandwidth  profiles  graphically  show  both  the  short  term  A  . 

z;::::;  rr, the  m,ire  ~  ** — -  -  -  * 

— z  ri::— “  r by  r 8  *  -* 

a  compact  graph  that  more  dearly  shows  the  fraction  ^  £  T  ^ 

various  bandwidth  levels  For  py  i  t,  applications  operate  at 

idle)  for  54%  of  the  time.  For  ,2%7’tHe  ti  "TT-  ‘h*  ”e''''0rk  ^ 

to  66%)  the  bandwidth  is  between  0  and  1  17/  1  ^  ^  M% 

higher  About  10%  of  ,h  ,  b.ts/cycle/proc,  and  the  rest  of  the  time  1,  is 

gner.  About  10%  of  the  tune  ,t  is  higher  than  4  bits/cycle/proc  The  „,h 

exhibit  less  variance  in  their  bandwidth  profiles.  aPPhCa*i°‘‘S 

loss  that  “  I'  ,0  eaSUy  ViS"1“ZC  “d  'ta  performance 

strained  ^  T  7  “  °”  *  “  *  bandwidth  con- 

bit/cycle/proc  85%  of  the  '  Xamp  C’ '  llle  network  I, as  a  bandwidth  capacity  of  1 

be  unimpeded  ’but  ,h  eXeCU‘i0"  ^  h^*1  ”«b  ^  *h“  «*•  a"«  "d, 

of  four.  ”*  P°r,i0n  Wm  W  SWd  *"">•  “"■>  fit  by  about  a  factor 

limits.  Th':!1;!;!  77777““  ,slowdowm  of  tbe  appbca,ions  under  ^-Mth 
profiles  Bltaat  f  •  -v  aPP  %"g  ‘be  squeeze  performance  model  to  the  bandwidth 

l, or  of  77  ”der  1  bandWMth  °f  1  ■*'**/*«  slow  down  by  a 

Using  this  table,  one  can  choose  an  appropriate  bandwidth  level  such  that  the 
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ds  capacity.  Au, .  '  A  hot  spot  become  a  n  ,  f  g  ®°re  traffic  tj,a 
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ead  across  the  netWorJ{  C*ed  **  well. 
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blkmat 
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1.01 
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Table  6.8:  Slowdown  factors  under  various  bandwidth  limit*. 

* 

tree  sa<ura*ion[PN85]. 

Hot  spots  occur  for  several  reasons.  We  classify  w  *  , 

Which  we  call  locution,  fw,  and  Worn.  boca.ion  hot  spots  0  J 

= “r  z 5hared -y  -  ~ 

-  - — -  j:xrb:r^::~oditaissp-d 

~r  r u,es  are  much  -  ** — 

spreading  addresses  across  the”  "  V'C,°r  COmP“ter5’  an<l  ““  ^  dimi,liSh<id 
dresses  in  our  sZZ  7T  ”  ^  ^  *  *d‘ 

of  hot  spots  which  are  due  simply  to  rlZ  SPO'S’  h"'  ‘hiS  the  third  “‘W 

through  random  chance,  some  memor;:!!::::^/:^  tolZiT 

A  Pessimistic  Performance  Model 

formance  of  the  enL  mlchlTe  ^ZT^olZl  ^e  ““  Tt 
spot  memory.  However  a  short  i; voa  j,  f  1  f  the  hot 

involved  or  possible  b  r  SP°‘  °"ly  Sl°W  d°W"  ‘h°Se  Pro“ssors  directly 

P  y  because  of  synchronization  and  data  dependencies  the  d  1 
propagate  to  other  processors  as  well.  y§  may 

•he  entire  mlr°!eZZePTmiST  °"t,0°k  ln<i  *  ““  ^  S»“*S  S'°»  d°™ 

thus  determine  the  ma  7  ^  ^  “  "temory  module 

mines  the  machine  slowdown  for  that  interval  Th*  *  *  ,  ,  , 

hy  applying  the  squeeze  performance  model  to  the  links  and  tlffi  '  7  T  CaICU,ated 
exiting  the  ho.  spot  memory  module.  d‘reCtly  “d 
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Hot  Memory  Module  Bandwidth  Profiles 


0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  100% 


percentile 

Figure  6.8:  Sorted  profiles  of  the  applications’  hot  memory  module  band¬ 
width  usage. 

Simulation  Results 

Figure  6.8  shows  the  simulation  results  for  the  hot  spot  memory  module  band¬ 
width.  These  are  typically  a  factor  of  4  to  8  larger  than  the  remote  memory  bandwidths, 
and  thus  hot  spots  are  an  important  part  of  the  traffic  picture. 

Table  6.9  shows  the  application  slowdowns  computed  using  our  performance  model. 
Based  on  these  results  we  can  say  that  the  network  needs  a  memory  module  bandwidth  of 


Table  6.9:  Slowdowns  factors  based  on  hot  spot  memory  module  bandwidth. 
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Application 

Average  BW 

Network 

Memory  Module 

BW 

over  design 
factor 

BW 

over  design 
factor 

blkmat 

0.98 

4 

4 

16 

16 

sor 

1.24 

4 

3 

32 

26 

ugray 

1.68 

2 

1.2 

16 

10 

barnes 

0.59 

1 

1.7 

8 

14 

Table  6.10:  Over  design  factors  for  the  network  and  memory  modules. 


16  bits/cycle.  We  should  qualify  this  by  restating  that  this  is  a  pessimistic  model,  and 
that  it  assumes  either  rapid  onset  of  tree  saturation  or  propagation  of  delays  to  processors 
not  directly  involved  in  the  hot  spot.  We  performed  additional  simulations  with  a  sample 
interval  of  500  cycles  in  order  to  gauge  how  sensitive  our  results  are  to  this  assumption. 
These  simulation,  which  are  probably  optimistic,  suggest  that  a  memory  bandwidth  of  8 
bits/cycle  is  adequate.  Thus  our  conclusion  is  that  memory  module  bandwidth  should  be 
from  8  to  16  bits/cycle. 

Compared  to  our  results  in  Section  6.2.1  indicating  a  remote  memory  bandwidth 
of  2  to  4  bits/cycle/proc,  the  memory  module  bandwidth  is  a  factor  of  4  higher.  The  direct 
implication  is  that  networks  having  higher  local  bandwidths  than  bisection  bandwidths  are 
advantageous.  For  instance  the  fat  tree  network  in  the  CM-5[LAD+92]  was  designed  so  that 
the  lowest  level  of  the  network  has  four  times  the  bandwidth  of  the  upper  levels.  Another 
example  is  the  networks  of  M.  T.  Raghunath[RR93]  that  provide  higher  local  bandwidths 
as  a  means  of  getting  high  utilization  of  the  bisection  bandwidth.  A  third  example  is  mesh 
networks  that  allows  adaptive  routing  of  traffic  around  the  hot  spot  memories. 

Another  implication  of  the  higher  memory  module  bandwidths  is  that  the  memory 
modules  must  be  over  designed  so  that  they  have  far  more  capacity  than  will  be  needed 
on  average.  We  can  calculate  this  over  design  factor  by  dividing  our  performance  model’s 
bandwidth  suggestions  by  the  average  bandwidths  actually  used  by  the  applications.  Ta¬ 
ble  6.10  shows  this  calculation  (for  each  application  individually)  for  both  the  network  and 
the  memory  modules.  The  network  and  memory  module  bandwidths  used  in  this  table 
were  taken  from  Tables  6.8  &  6.9  at  levels  that  allowed  achieving  slowdowns  <  1.10.  The 
network  over  design  factors  are  moderate  and  range  from  1.2  to  4.  The  memory  module 
over  design  factors  are  much  larger  and  range  from  10  to  26. 
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Such  large  over  design  factors  are  required  to  service  the  hot  spots  in  the  memorv 
access  patterns.  These  hot  spots  arise  because  of  the  inevitable  non-uniformity  of  the 
random  message  distribution.  An  analogous  problem  is  the  random  distribution  of  „  balls 

mto  n  buckets.  On  average  each  bucket  will  receive  1  ball,  but  the  worst  case  bucket  will 
receive  ft(log  nj  log  log  n)  balls. 

6.2.3  Location  Hot  Spots 

•  The  elimination  and  reduction  of  location  hot  spots  has  been  the  subject  of  a 
large  body  of  research[DK92,  GGK+82,  MCS91,  PN85,  Ran89,  YTL86].  These  involve 
e.ther  hardware  combming  of  messages  or  software  combining  trees.  Hardware  support  is 
typically  for  fe.ch-and-add  operations,  from  which  many  highly  parallel  synchronization 

technrques  can  he  built[GGK+82J.  Software  techniques  have  been  devised  for  barriers  and 
synchronous  reductions[MCS91,  YTL86]. 

Despite  the  large  amount  of  research  on  combining  techniques,  there  has  been  little 
premous  work  done  on  measuring  how  beneficial  combining  would  be  for  real  applications 
s  is  partly  a  chicken  and  egg  problem  because  without  hardware  support,  programmers 
have  h, tie  incentive  to  use  fe.ch-and-add  like  operations.  Although  i,  has  been  suggested 
at  etch-and-add  ,s  useful  even  if  not  combined[MCS91]  because  it  is  a  simple  atomic 
operation  that  can  be  performed  quickly  at  the  memories.  We  have  provided  such  a  non¬ 
comb, mng  fetch-and-add  operation  in  our  simulation  system,  but  we  have  used  it  only  a 
few  times.  J 

In  general,  ordinary  memory  requests,  such  as  several  reads  to  a  single  location 
can  also  be  combined.  In  this  section  we  determine  an  upper  bound  on  the  benefits  of 
ardware  comb, mng.  Our  simulations  measure  (indirectly)  the  total  number  of  accesses  to 
each  individual  memory  location  during  a  sample  interval.  We  then  use  these  numbers  as 
our  upper  bound  on  combining.  In  other  words,  we  assume  .ha,  all  references  to  a  single 
location  during  a  sample  interval  can  be  combined.  This  is  optimistic  for  two  reasons  First 
combming  of  different  reference  types  (such  as  a  read,  a  fetch-and-add,  and  a  write)  is  very 
complex  and  unlikely  to  ever  be  implemented.  And  second,  our  sample  interval  of  100  cycles 
IS  ong  enough  that  in  a  real  network  messages  will  often  pass  through  the  routing  nodes 
before  their  potential  combining  partners  arrive. 

Figure  6.9  shows  the  amount  of  traffic  a,  the  hottest  (most  heavily  used)  location 
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Hot  Location  Bandwidth  Profiles 


percentile 


Figure  6.9:  Sorted  profiles  of  the  applications’  hot  location  bandwidth  usage. 


during  each  sample  interval.  As  before,  the  samples  are  sorted  from  lowest  to  highest. 
To  interpret  this  data  we  need  to  know  the  correspondence  between  bandwidth  and  the 
actual  number  of  memory  operations  that  occurred.  Figure  6.3  showed  our  assumptions  for 
message  sizes.  The  messages  needed  to  service  a  simple  read  miss  constitute  a  total  of  224 
bits  (7  words)  of  traffic.  A  read  miss  is  the  most  common  operation  in  the  network  and 
we  will  use  it  as  our  basis  for  calculation.  When  normalized  to  bits/cycle  over  a  100  cycle 
sample  interval,  as  we  have  used,  a  single  read  miss  uses  a  bandwidth  of  2.24  bits/cycle. 

Using  ugray  as  an  example,  the  lowest  16%  of  the  intervals  show  a  hot  location 
bandwidth  of  2.24  bits/cycle,  which  is  equivalent  to  the  traffic  from  one  read  miss  message. 
This  means  that  no  location  was  referenced  more  than  once  during  these  intervals.  Next 
there  are  some  sample  intervals  that  have  slightly  higher  bandwidths.  These  most  likely 
represent  a  single  access  that  caused  some  invalidation  traffic.  After  these,  the  rest  of  the 
intervals  all  the  way  up  to  the  94th  percentile  show  a  bandwidth  of  4.48  bits/cycle,  which 
equals  2  messages.  These  two  messages  might  be  combined,  but  such  limited  combining  has 
little  benefit  and  is  not  the  motivation  for  combining  hardware. 
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f:  S°r'  U*  °!  ^  intervals  for  „gray,  ^  d“™^  2%  of  the  inters 

ho.  spots  are  primarily  due  '  °|  ‘he  “‘"rvais  for  barnes.  For  sor 

Purpose  barrier  network  as  on  the  CM-5[LAD+92]  oTp  h  '  el'm“ated  Witt  *  SpecW 
YTE86J.  Since  serious  location  ho.  spots  are  so  i„f,e  T  Mtl1  S°f,WMe  b*rriers[MCS91, 
combining  is  justified.  "frequent,  we  do  not  believe  that  hardware 

We  should  note  that  two  of  tha  v 

*>r  showed  an  unexpected  7T‘  ^  **•  run  of 

resetting  a  shared  Hag  when  only  one  processor  neededVje,  ^  *” 

for  ugray,  there  was  a  problem  with  t  •  '  *US  Was  ea5lly  amended. 

^  list.  This  lock  was  a  bottleneck,  b„,  had  ^  ^  ^  ^  • 

"ever  previously  been  run  with  so  many  proce  s  I  tII  tT  ^  ^  ^ 

a  parallel  free  list.  SOf5-  The  boWeueck  was  eliminated  by  using 


6-3  Suramary  and  Implications 


r  most  applications,  caching  is  effective  at  a  • 
requirement.  Thus  despite  the  cost  and  comnlet  * 

is  hkely  to  be  cos.  effective  because  of  the  reducf  y  ”  Pr°V,d“g  COllerei“  CaCteS’  CacU"S 
skinnier  network  than  on  systems  without  caching0”  “  “*"*  “■*  ***“  *  ^wing  a 
When  burstiness  is  taken  into 

model  have  shown  that  the  networks  for  largeTaLT  SimUla“°”  ^  ^  Perfo™ance 
should  provide  a  remote  memory  bandwidth  of  f  T"^  ^  mul‘iprocessors 

module  bandwidth  o,  from  8  J  I#  ^ £ *  *°  4  “‘«Proc  and  a  memory 

-ory  modu.es  is  needed  to  accomnjje  r an^  * 

fore  should  be  ^  ^ 

Processors.  Specifically,  our  results  are  based  on  th  ^  difere"“s 

Processor  operating  at  one  operation  p^  ^  R3°°0[Kaa89)'  *  “ 

operations  per  cycle,  for  example,  would  thus  need  .  Pr°CeSSOr  "P'rating  a.  two 

thus  need  to  support  twice  the  network  bandwidth. 
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eters.  The  link  widths  are  based  on  tOPOl°SieS  ^  S°me  Param- 

~r-~  s° ,hat  the  ~  JL'Zrr iy  ms  ™  ^ 

a  proc/cyde  or  a  memory  module  bandwidth  of  16  b't  /"T"  ba“‘iwi,itl>  of 

widths  was  then  selected  and  used  for  caiculati  '  S/c-vd'’-  The  larger  of  these  tWQ 
estimate  of  network  cost.  The  key  observation  to  b  ?  ^  Whid“  “  a  ™rgh 

the  butterfly  pr„vide  more  locaJ  ba  6  “  tba‘  of  the  networks  except 

iocal  bandwidth  is  beneflciai  because  it  provides  thTh  h^b”  “d  “  «*» 

spot  memory  modules.  gher  bandwidth  needed  by  the  hot 


bidi,^ 
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primatives.  For  example,  a  lock  on  the  Sequent  Symmetry  is  ins,  a  mem  1  .• 

its  valnpfrTom  Tf  ,  ■  y  J  1  memory  location  and 

ts  vainelGTMJ.  If  the  v^ue  ,s  0,  the  lock  is  free;  if  i,  is  !,  ,he  ,ock  is  taken.  To  obtain  the 

ock  a  processor  executes  a  suapCaddx, ,)  instruction.  This  instruction  atomic^  reads 

e  old  value  at  the  address  and  writes  the  new  value  of  1.  If  the  snap  instruction  returns 

an  od  value  of  0,  the  lock  has  been  obtained,  but  if  i,  returns  the  value  1,  then  the  lock  ' 

ea  y  taken.  In  this  case,  the  processor  spins;  continuously  reading  the  memory  location 

unt.1  it  changes  to  0,  a,  which  point  the  processor  retries  the  snap  instruction. 

When  the  lock  is  free,  this  is  very  efficient  since  there  is  just  a  single  memory 

t  ‘OCk:  “d  *-  *  **  -*•<-*..>  instruction  to 
lock.  The  problem  arises  when  the  lock  is  simultaneously  desired  by  several 

processors.  One  processor  will  obtain  the  lock;  the  others  will  incessantly  read  the  lock 

location  waiting  for  it,  release.  These  reads  would  saturate  the  bus  to  shared  memory 

except  , ha,  most  are  filtered  on,  by  the  caches.  After  the  first  read,  the  value  is  cached 

n  subsequent  reads  simply  spin  on  the  value  in  the  cache.  When  eventually  the  processor 

“ by  ^ a  ° ,o  ,he  ,ock 

lock  ’  *“  *he  CaCheS'  At  ,hiS  POi"*  *he  remaining  contenders  all  race  to  obtain  the 

accesses  Bvl"  7“  ^  °f  0(»)  bus 

■y  e  time  a  group  of  n  processors  all  get  their  turn  with  the  lock  there  will 

have  been  0(n  )  bus  accesses  to  the  lock  location.  There  are  a  number  of  . .  . 

,  ,  .  .  ere  are  a  number  of  more  sophisticated 

ock  implementation  which  try  to  reduce  this  traffic[And90,  GT90,  MCS91J.  The  best  of 
*hese  reduce  bus  traffic  to  0(n)  by  bufiding  a  software  queue  in  which  the  w^ting  processors 

Zell::  :r flags- Measins  a  iock  tavoives  ^  -  **  «. - nex. 

—  .rr  r  i,:*;;::;:*'  "•  — -  --  *-  - 


Fetch- And- Add 


The  NYU  Ultracompute,  projec,[GGK+82J  proposed  an  innovative  synchronize- 
■on  ms, met, on;  f«ch-and-add  (ftadd).  FkaddCaddr,  vaku.)  reads  the  specified  Zm 

and  ",  T""“  ‘he  CO"‘e“,S  “  ‘he  °Perati°”’S  reS'11'’  and  ,h<"1  adds  specified  value 
and  stores  the  sum  back  into  the  addressed  location. 
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A  simple  use  of  f&add  is  to  have  each  of  n  processors  execute  ftaddCA  l)  ,, 

Y  X~0'  ‘he  V3lueS  retnrMd  1*  and  go  from  0  to  „  -  1.  These  values 

^Sh.  then  he  used  to  select  u„iq„e  tasks  for  each  processor  In  a  parallel  comp^T 

The  power  of  fkadd  comes  from  the  fact  that  multiple  fkadd  messages  can  be 

combmed  m  a  tree  Bke  fashion  as  they  proceed  through  a  butterfly  network.  If  combining 

works  well,  a  group  of  simultaneous  fkadd's  wUl  get  fully  combined  so  .ha,  only  a  sin 

ge  message  actually  reaches  the  memory  module.  The  result  message  returned  from  the 

memory  module  will  then  be  split  apart  and  the  correct  return  values  computed  as  the 

message(s)  return  back  down  the  combining  tree.  This  can  be  done  in  such  a  way  tha, 

e  responses  are  the  same  as  if  the  «add's  had  been  performed  sequentially.  Combining 

rrrrr  "  the  dUe  *°  t0*  *°  b*  <*■>«*  our  result! 

m  Section  6.2.3  suggest  tha,  such  congestion  is  rare  in  our  applications). 

Gottlieb[GGK+82]  shows  that  data  structures  such  as  parallel  queues  can  be  de- 

signed  using  fkadd  and  combining  so  that  there  is  no  serial  bottleneck.  In  other  words 

undreds  of  processor  can  simultaneously  insert  and  remove  entries  from  the  queue  without 

ever  entering  a  critical  section  (where  only  a  single  processor  has  exclusive  access  to  the 
internal  queue  data  structures). 

Although  fiadd  based  synchronization  routines  (along  with  combining)  can  elim¬ 
inate  memory  ho,  spots  (if  they  were  a  problem),  these  routines  still  use  spinmng  in  order 
to  wait  for  a  synchronizing  even,:  such  as  the  release  of  a  lock  or  the  insertion  of  an  entry 

in  a  queue.  Thus  from  the  point  of  view  of  a  multithreaded  processor,  cycles  would  still  be 
wasted  by  threads  waiting  on  synchronization  operations. 


Full/Empty  Bits 

Synchronization  on  the  HEP[Kow85)  was  done  through  full/empty  bits  associated 
w  h  each  location.  For  example,  there  was  a  write  instruction  tha,  would  se,  the 

m  b.  when  „  wrote  a  value,  and  there  was  a  read  instruction  tha,  would  check  tha,  the 
ull  bit  was  set  before  reading  the  value.  If  the  location  was  empty,  the  reading  thread 

won  wait  until  an  appropriate  write  occurred.  These  full/empty  bits  allowed  very  fast 
and  fine-grained  synchronization. 

would  b  Si“C,e  Tr  "*  1  mul,i,hreaded  .  ■Phi*  ‘be  processor  while  wmting 

wasteful.  Instead  there  was  a  separate  unit  called  the  Storage  Function  Unit  that 
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did  the  spinning. 


7.1.2  Non-Spinning  Synchronization 

All  of  the  spin  based  synchronization  techniques  reviewed  in  the  „  • 

r t:,y;  :fr memory 

-  w  w  “  rrrr:  ;::r -  thra  w* *  - 

ready  a  ““  a  loch  is 

i-tigate  the  notation  of  a  waiting  thread.  ^ 

ndamentally,  synchronization  mechanisms  such  as  locks  a.nd  h 
a  message  from  a  thread  to  ,  .  cis  and  Carriers  involve 

8  a  thread  to  a  synchronization  agentfSync-Aeenn  and  *u 
message  from  the  Sync-Ae-ent  to  7  S  d  then  a  subseq«ent 

immediately  sends  a  renlv  me  1§  aVajlabIe’ tbe  Sync- Agent 

ule,  if  Jo«h ;;  yf;  rr  s;;rg  ,he  iock  ,o  tte  *kread  *•  *»*  as 

request  for  the  lock  then  arrives  while  the  lock  is  taken  ♦»,  c 
qUMeS  ‘US  "*«.  “* *«  when  the  loch  is  eventually  released  ITsl'n  A 

t°:“^  •»  ^zz:i*r - 

*  ey  were  designed  w„h„„,  consideration  of  mu!, i, breading  and  thus  use 

«  ~s  -  - 

implementing  it  as  such  In  fact  1  1  >  message  passing  M(i  therefore  directly 

erences  are  also  messages  as  wed  ‘  memory  ref. 

-  *  »*  message  *B‘  *  -  *  -e, 

mented  with  niessages,  our  tnessage  based  synchroniaaj  wZ"^  ^ 

implementation  tha,  hts  naturady  into  the  design  of  onr  mnltithreaded  ^ 1 
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Messages 

What  was  described  above  as  the  synchronization  agent,  is  really  just  the  interface 
to  a  memory  module.  For  normal  memory  operations  the  interface  receives  messages,  does 
the  memory  operation,  and  sends  replies.  It  also  must  send  and  receive  any  messages  needed 
to  maintain  cache  coherency.  For  synchronization  operations,  the  memory  interface  sends 
the  messages  dictated  by  those  operations. 

A  synchronization  variable  is  a  memory  location  just  like  any  other  variable,  the 
only  difference  is  that  it  is  accessed  via  synchronization  instructions  rather  than  normal 
memory  access  instructions. 

We  have  provided  the  following  synchronization  operations  in  our  simulator: 

lock:  A  message  is  sent  to  request  a  lock,  and  when  the  lock  becomes  available,  a  message 
is  returned  granting  the  lock.  An  unlock  message  is  used  to  release  a  lock. 

barrier:  Each  thread  sends  a  message  when  it  is  ready  to  check  in  at  a  barrier,  and  when 
all  threads  have  checked  in,  barrier  completion  messages  are  returned  to  each  of  the 
threads1 . 

fetch-and-add:  This  is  the  same  as  for  the  NYU  Ultracomputer  except  that  there  is  no 
combining.  A  f  fcadd  message  is  sent  to  the  memory,  the  addition  is  performed  there, 
and  the  reply  message  returns  the  fetched  value. 

wait:  This  is  similar  to  the  full/empty  bits  on  the  HEP  except  that  it  is  just  a  synchroniza¬ 
tion:  there  is  no  associated  data  transfer.  A  message  is  sent  to  wait  for  a  specified  flag 
to  be  set.  If  the  flag  is  already  set,  a  completion  messages  is  immediately  returned, 
otherwise  the  response  occurs  later  when  the  flag  is  set.  There  are  also  messages  for 
setting  and  resetting  the  flag. 

The  messages  formats  and  their  sizes  are  shown  in  Figure  7.1.  These  messages 
are  the  same  format  as  the  messages  used  for  regular  memory  operations,  except  that  no 
data  is  associated  with  most  of  the  synchronization  operations.  They  are  therefore  compact 
messages;  most  are  just  one  or  two  words  long. 

In  our  general  simulations  we  have  not  modeled  the  limited  network  bandwidth  and  the  serialization 
at  the  memory  module  of  the  many  barrier  messages.  In  Section  6.2  we  addressed  the  network  bandwidth 
limits  and  found  that  barriers  are  infrequent  enough  in  our  applications  that  the  occasional  congestion  they 
cause  has  only  a  minor  performance  impact.  Later  in  this  section  we  will  mention  some  congestion  free 
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Figure  7.1:  Messages  for  synchronization  operations. 
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Implementation 

Our  message  based  synchronization  fits  nicelv  with  +v,  j  •  , 

processor.  When  a  thread  issues  a  nor™,  °f  * 

normal  memory  access  into  the  network  that  thr  j  ■ 

then  context  switched  out  and  it  waits  for  th*  ’  h  h  d  18 

aim  n  waits  lor  the  response  message.  The  exact  tv 

oCcur  for  synchronization.  For  example  a  thread  requests  a  lock  bv  d  ^ 

into  the  network,  and  then  it  waits  for  th*  7  S6ndmg  a  messaSe 

.anted.  The  only  ~  ^  ^  ^ 

that  memory  references  have  fairly  uniform  latencies  (a few  hundred  cyles) 

varte  latencies  (depending  upon  how  many  other  threads  are  also  w^t  ^  a  T 
how  long  they  each  hold  it).  S  h  k  and 

Because  issuing  a  memory  reference  or  a  synchronisation  reference  hoth 
descheduling  of  a  thread  and  fl,  •  ,  reierence  both  cause 

may  have  long  latencies,  strictly  round  robin  schedu^g  7^27x2' 
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Figure  7.2:  Operation  of  the 


waiting  queue  for  a  lock. 
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in  which  it  can  build  waiting  queues  M  li„ked  Usts.  This  array  to„tains  on(j  fcr 

eac  t  read,  but  only  one  such  array  will  be  needed  per  memory  module,  regardless  of  the 
number  of  locks. 

In  the  example  in  Figure  7.2,  4  is  a  lock  variable  which  is  initially  set  to  free 
Thread  6  is  the  first  to  request  the  lock  and  is  therefore  immediately  granted  the  lock' 
Threads  5  and  3  then  request  the  lock  and  are  queued  up  in  FIFO  order.  Next  thread 
releases  the  lock,  which  causes  the  lock  to  be  granted  to  the  thread  a.  the  head  of  the 
wartrng  list:  thread  5.  When  thread  5  releases  the  lock,  the  lock  is  granted  to  thread  3. 
nd  finally  when  thread  3  releases  the  lock,  the  lock  returns  to  the  free  state 

Sin“  *he  thr6e  W°rd  '0ck  “<•  ««  waiting  queue  array  all  reside  „„  the 

same  memory  module,  they  can  easUy  be  updated  atomically.  The  memory  interface  unit 

simp  y  performs  all  its  operations  for  the  lock  variable  before  servicing  the  next  incoming 

If  an  application  has  multiple  lock  variables,  they  should  be  spread  across  the 
memory  modules  to  avoid  unnecessary  hot  spots.  Since  the  number  of  lock  variables  is 
«  mi  ted,  there  may  be  several  lock  variables  on  each  memory  module.  These  can  all  share 
a  single  waiting  queue  array  since  the  threads  in  each  waiting  queue  will  be  distinct  This 

"  A "!Ca'1Se  *  "  P°SSible  fOT  1  •«>  I-  waiting  on  one  lock  a,  a  time,  and  thus 

it  could  never  be  on  more  than  one  queue.  I,  is  perfectly  valid,  however,  for  a  thread  to 
obtain  nested  locks.  They  just  must  be  obtained  one  at  a  time. 

Because  of  the  complexity  of  synchronization  (and  cache  coherency)  it  is  likely 
that  the  memory  interface  will  be  some  sort  of  programmable  device.  In  fact,  each  memory 
module  will  kkely  be  connected  to  one  of  the  processors,  and  the  synchronization  and  cache 
co  erency  might  be  handled  by  a  quick  interrupt  to  the  processor.  The  Wisconsin  Wind 
unne  HLRW92]  uses  a  CM-5  in  this  fashion;  the  processor  manages  the  cache  coherency 
protocol.  We  suggest  letting  the  processor  handle  synchronization  operations  in  the  spirit 
of  active  messages[vECGS92],  This  allows  the  synchronization  operations  to  be  changed 
an  supplemented  rather  than  being  designed  into  the  machine. 

Besides  locks,  other  synchronization  operations  such  as  barriers,  waits,  and  queues 
can  also  be  built  with  messages  so  that  no  spinning  is  required.  In  addition  to  eliminat¬ 
ing  spinning,  performing  complex  synchronization  operations  at  the  memory  modules  has 
•he  a  vantage  of  being  faster  than  building  them  on,  of  several  simpler  synchronization 
operations  that  each  involve  a  network  traversal  delay[BR90]. 
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If  contention  is  a  problem,  many  synchronization  operations  can  be  implemented  in 
a  distributed  fashion.  Barriers  can  be  implemented  with  software  combining  trees[YTL86] 
or  a  potentially  faster  technique  called  the  dissemination  barrier[MCS91].  Index  distribution 
and  work  queues  can  be  implemented  using  the  low  contention  techniques  of  Herlihy[HLS92], 

These  are  based  on  counting  networks  and  do  not  use  spinning  if  there  is  an  atomic  memory 
operation  such  as  fetch-and-add.  J 


7.2  Line  Size  for  Minimizing  Bandwidth 


In  this  section  we  study  the  affect  of  cache  lines  sizes  on  the  network  bandwidth 
needs  of  our  applications.  A  large  cache  line  has  the  potential  of  decreasing  network  band¬ 
width  because  the  headers  for  routing  and  specifying  a  memory  access  are  of  fixed  size  (see 

Figure  6.3),  but  the  data  payload  varies  with  the  cache  line  size.  With  a  larger  line,  the 
fraction  of  bandwidth  used  for  data  is  higher. 

Id  practice,  a  larger  line  size  might  might  actually  increase  the  bandwidth  require¬ 
ment  for  several  reasons.  First,  the  requesting  processor  may  not  use  all  of  the  locations 
in  a  cache  line.  These  unused  locations  use  bandwidth  when  they  are  brought  across  the 
network,  but  do  not  otherwise  affect  performance.  Second,  a  large  line  size  increases  the 
likely  hood  of  false  sharing.  This  is  the  case  where  two  different  processors  access  different 
parts  of  a  single  cache  Une,  and  the  cache  line  is  ping-ponged  back  and  forth  between  the 
processors  even  though  they  are  not  actually  sharing  any  variables.  Third,  larger  hue  sizes 
imply  fewer  total  lines  in  the  cache  and  thus  increase  the  probability  that  useful  data  will 
get  replaced  with  unwanted  data  (cache  pollution). 

Figure  7.3  shows  the  bandwidth  usage  of  the  applications  when  run  with  cache  line 
sizes  ranging  from  8  to  128  bytes.  The  experimental  parameters  are  shown  in  Table  7.2. 
These  experiments  did  not  use  multithreading.  However,  we  expect  that  the  same  rela¬ 
tive  relationship  between  bandwidth  and  line  size  will  continue  to  hold  for  multithreaded 
systems. 


Most  of  the  applications  have  increasing  bandwidth  needs  with  larger  line  sizes. 
The  best  choices  for  minimizing  bandwidth  usage  are  either  8,  16,  or  32  byte  cache  lines 
depending  upon  the  application.  A  16  byte  line  size  is  the  best  overall  choice,  and  that  is 
the  value  we  have  used  throughout  the  studies  in  this  thesis. 

0’Krafka[0’K92]  also  looked  at  traffic  as  a  function  of  cache  line  size.  He  studied 
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Figure  7.3:  Bandwidth  as  a  function  of  line  size, 
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Table  7.2:  Experimental  parameters  for  measurements  of  bandwidth 
line  size. 
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Figure  7.4: 


Miss  rates  as  a  function  of  line  size 
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Frgure  7.4  shows  the  ntiss  rates  as  a  function  of  hue  size  These  1  a 

decrease  with  larger  line  size  and  do  f  ,  '  W  be  exPe«ed  to 

and  locus).  The  diminishing  rates  of  decree!  litTlaTIrT'™3  (blkl"at’  ”P3d’ 
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increasing  miss  rates  ^  ^ 
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utthzations  as  a  function  of  fine  size  and  confirms  this  infers  relatl  h 
rates  and  performance.  However  since  mi„  ,  a  relationship  between  miss 

er,  since  miss  rates  do  not  diminish  with  increased  line  size 
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Figure  7.5:  Processor  utilization  as  a  function  of  line  size, 


for  all  of  the  applications,  performance  increases  for  some  applications  bn,  declines  for 

The  goal  of  this  section  is  to  determine  an  appropriate  fine  size  for  minimizing  the 
network  bandwidth  requirement.  Simply  looking  at  the  bandwidth  as  a  function  of  line  size 
can  be  tmsleadmg  because  bandwidth  usage  might  be  low  simply  because  the  processors  were 
.  e  most  of  the  t,me.  Figure  7.6  adjusts  for  this  possibility  by  normalizing  the  bandwidth 
ts  from  Figure  7.3  by  the  processor  utilization  results  from  Figure  7.5.  (This  sort  of 
normalization  actually  occurs  when  multithreading  is  used  to  increase  performance  to  a 
■g  er  level.)  The  earlier  conclusion  that  a  16  bytes  line  size  is  the  best  overall  choice 
or  minimizing  bandwidth  still  holds,  although  several  applications  now  have  slightly  lower 
bandwidth  requirements  with  a  32  byte  line  size. 

Network  bandwidth  is  no,  the  sole  issue  in  choosing  a  line  size.  Another  major 
factor  is  the  overhead  in  terms  of  cache  tag  space  and  directory  storage  tha,  is  needed  to 
locate  hues  and  manage  coherency.  The  number  of  hues  is  directly  proportional  to  the 
amount  of  this  storage,  and  thus  doubling  the  lines  size  halves  the  amount  of  tag  and 
.rectory  storage.  Because  of  the  large  amount  of  directory  storage,  this  may  be  a  more 
important  factor  than  bandwidth.  The  KSRi[Ken92j,  for  example,  has  chosen  a  128  byte 
line  size  in  order  to  limit  the  size  of  its  distributed  directories. 
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7.3  Cache  Degradation  due  to  Multithreading 

-^vior  p~ is  ^  p~ 

used  by  each  other[ALKK90,  CGL92,  SBCvE90  W  GsT  Th^  ^  ^  ^  ^ 

as  they  access  the  cache  can  be  dest  r  ’  '  ^  lnteractl°n  between  threads 

zrrr;;:j,rrr  - — 
“:r;,rr  *  *“•  *  -  -  * 

one  for  local  data,  and  one  for  ^  ^ instruction^ 

ered  the  shared  data  cache.  We  expect  that  the  '  °Ut  '  W  have  orll-v  con«i<l' 

interaction  between  the  threads  since  they  are  ^  ^  ShOUld  ^  CO"str««™ 

th  y  eXKU,e  ia  ““  ■*«  of  code  simnitaneonsiy.  For  iocaj  da  a  h  “  “** 

wtll  be  destructive,  and  caches  should  clearly  be  1/  ■  j  ’  ta,e™‘i<™ 

pie  threads.  Figure  7  7  shows  th  h  ^  t0  accommodate  the  multi- 

of  multithreading4.  '  *  "**  ^  the  shared  da‘a  cache  as  a  function 

The  simulation  parameters  are  the  «,  -or, 

ers  are  the  same  as  in  Table  5.6. 
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Figure  7.7:  Cache  miss  rates  as  a  function  of  multithreading. 


Generally  the  miss  rates  increase  slightly  with  multithreading.  Saavedra-Barrera 
[SBCvE90]  modeled  this  increase  in  miss  rates  based  on  the  assumption  that  with  multi¬ 
threading  of  N,  the  cache  would  behave  as  if  it  were  partitioned  into  N  subsets,  each  of  size 
1/iVth  of  the  original  cache.  Then  based  on  an  analytic  model  of  cache  behavior,  he  derived 
an  expression  for  the  miss  ratio  as  a  function  of  multithreading:  m(N)  =  m(l)NK,  where 
m{N)  is  the  miss  ratio  with  multithreading  N,  and  K  is  a  constant  that  depends  on  the 

behavior  of  the  applications.  This  model  does  fit  the  behavior  of  many  of  the  applications, 
but  the  values  of  K  vary  widely. 

A  major  factor  in  this  variation  in  cache  behavior  is  whether  the  threads  interact 
constructively  or  destructively.  For  some  applications  this  depends  on  the  order  in' which 
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thread:  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 

interleaved: 

blocked: 


pi= 


P2= 


P3= 


P4= 


Figure  7.8:  Assignment  of  threads  to  processors. 


threads  are  assigned  to  processors.  For  example,  the  sor  application  works  on  a  two  di¬ 
mensional  array  that  is  partitioned  into  rectangular  blocks  with  one  block  per  thread  (see 
Section  2.2.3).  Interaction  between  threads  occurs  for  the  data  along  the  edges  of  these 
blocks,  and  if  abutting  blocks  (threads)  are  assigned  to  the  same  processor,  the  common 
edges  will  interact  constructively  in  the  cache. 

Figure  7.8  shows  two  possible  orders  for  assigning  threads  to  processors.  The 
blocked  order  is  better  for  sor,  baroes,  and  blkmat  because  they  partition  the  problem  by 
thread  id  numbers;  and  thus  threads  working  on  neighboring  regions  are  assigned  to  the 
same  processor.  Blocked  ordering  is  actually  worse  for  water  because  the  particles  are  not 
isotropically  distributed;  and  thus  blocked  assignment  aggravates  load  imbalance  on  the 
processors.  For  the  other  applications,  work  scheduling  is  less  structured  and  the  thread 
ordering  is  unimportant.  For  the  simulations  in  this  thesis,  we  used  blocked  assignment  for 
all  of  the  applications  except  for  water,  for  which  we  used  interleaved  assignment. 

7.4  Longer  Latencies 

The  final  issue  which  we  address  in  this  chapter  is  what  happens  when  latencies 
are  longer  than  200  cycles.  We  should  expect  that  without  multithreading  the  increased 
latencies  will  mean  longer  waits  for  remote  references  and  thus  lower  efficiencies.  With 
multithreading,  we  should  expect  that  more  threads  will  be  needed  to  hide  the  longer 
latencies. 

Figures  7.9  &  7.10  show  simulation  results  for  the  applications  at  latencies  of 
200,  500,  and  1000  cycles5.  As  expected,  single- threaded  efficiencies  drop  with  increased 

The  simulation  parameters  are  the  same  as  in  Table  5.6  except  that  the  latency  has  been  varied  and 
higher  multithreading  levels  were  used  for  some  applications. 
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for  longer  periods  of  time,  increasing  lock  contention. 


Finer  Granularity:  As  more  threads  are  demanded  from  a  (bred  sired  problem,  the  prob- 

em  must  be  divided  into  smaller  pieces  or  work.  This  finer  partitioning  leads  to  more 
communication  and  thus  shorter  run-lengths. 

There  also  several  ways  to  ameliorate  the  impact  of  longer  latencies: 

larger  Problem  Sires:  When  the  applications  are  rnn  with  larger  problem  sizes  the 
granular, ty  of  tasks  can  be  increased  and/or  more  threads  will  be  available  to’hide 

larger  Caches:  Larger  caches  can  decrease  the  rrnss  rates,  which  both  increases  average 
run-lengths  and  decreases  the  number  of  points  at  which  stalling  might  occur. 

BeUe«Ll°flBa!anCin8:  M°re  bajMCing  0,th"adS  «■».  ‘I*  dura- 

tion  of  the  slow  completion  process  of  the  last  thread. 

Despite  the  lower  efficiencies  achieved  with  multithreading  under  very  long  laten 
aes,  mult.threadmg  provides  larger  net  performance  gains.  The  typical  application  (such 
as  .or)  has  a  100%  performance  improvement  (from  30%  to  60%  efficiency)  under  multi 
bread, ng  w„h  a  !000  cycle  latency.  Whereas  with  a  200  cycle  latency,  typical  performance 
improvements  were  33%  (from  60%  efficiency  to  80%  efficiency). 
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Chapter  8 

Hardware  Support 


Most  previous  research  into  multithreading  machines  has  involved  complex  hard¬ 
ware  to  support  the  switch-every-cycle  model[ACC+90,  HF88,  Kow85,  PC90].  These  ma¬ 
chine  were  built  to  switch  every  cycle  so  as  to  allow  a  fast  pipeline  implementation  with¬ 
out  concern  for  data  dependencies.  In  addition,  this  multithreading  mechanism  is  used 
to  tolerate  long  unpredictable  memory  latencies.  These  machines  typically  also  provide 
sophisticated  synchronization  (using  full/empty  bits  on  memory),  and  support  powerful 
programming  models  allowing  rapid  and  dynamic  creation  of  threads. 

Unfortunately,  every  hardware  capability  has  its  cost.  TERA[ACC+90],  for  in¬ 
stance,  allows  fast  dynamic  creation  of  threads,  and  it  provides  128  banks  of  32  registers  to 
hold  these  threads’  register  values.  This  huge  amount  of  hardware  complicates  the  machine 
because  it  slows  down  the  access  time  to  the  register  file.  Alternatively,  Monsoon[PC90] 
limits  the  size  of  the  register  file  by  severely  restricting  the  number  and  lifetimes  of  reg¬ 
isters,  which  undoubt  ably  has  a  negative  performance  impact.  It  remains  unclear  if  these 
machines  offer  advantages  in  either  performance  or  ease  of  programming  compared  to  the 
simple  multithreaded  shared  memory  model  studied  in  this  dissertation. 

We  have  studied  multithreading  models  which  we  feel  have  a  balance  between  com¬ 
putational  flexibility  and  implementation  simplicity.  Only  a  small  number  of  threads  are 
allowed  per  processor  and  thus  the  register  file  can  be  kept  reasonably  small.  The  program¬ 
ming  model  involves  a  static  set  of  threads  that  is  used  for  the  lifetime  of  the  program  and 
thus  support  for  fast  dynamic  thread  creation  is  not  needed.  The  synchronization  mecha¬ 
nisms  are  simple  and  similar  to  remote  memory  references.  Finally,  thread  scheduling  uses 
simple  policies  and  only  switches  threads  at  special  events.  In  this  chapter  we  will  present 
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grouping  (refer  to  Table  4.4)  except  that  the  context  switch  time  has  been  increased.  The 
performance  loss  varies  by  application.  The  applications  that  have  long  average  run-lengths 
(such  as  sieve,  blkmat,  water,  and  baraes)  lose  just  a  few  percent  of  their  performance. 
The  other  applications  (sor,  ugray,  locus,  and  mp3d)  have  shorter  run  lengths,  context 
switch  more  frequently,  and  thus  incur  a  larger  performance  loss  from  the  slow  context 
switch.  The  worst  performance  loss  is  16%  which  occurs  for  sor  and  mp3d. 

The  performance  losses  are  lower  than  we  expected  for  two  reasons.  First,  average 
run-lengths  for  the  explicit-switch  model  with  inter-block  grouping  (as  seen  in  Section  4.3.2) 
range  from  30  cycles  to  more  than  200  cycles.  With  longer  run-lengths,  the  context  switch 
time  has  less  impact  on  performance.  Second,  some  of  the  cycles  lost  to  the  slower  context 
switch  would  have  otherwise  been  lost  to  memory  latency.  This  is  a  small  effect,  since 
multithreading  is  usually  effective  at  hiding  most  of  the  memory  latency. 

This  section  has  shown  that  pipelining  the  context  switch  is  feasible  and  can 
provide  as  much  as  16%  performance  improvement  over  a  slower  context  switch  that  waits 
for  the  pipeline  to  drain  before  starting  the  next  thread. 

8.1.2  Result  Matching 

Multithreading  allows  issuing  multiple  memory  references  into  the  network  to  hide 
network  latency.  A  difficulty  arises  because  most  networks  do  not  preserve  message  order 
and  thus  the  responses  must  be  matched  with  the  requests. 

A  simple  solution  is  to  send  a  small  tag  along  with  each  message.  This  tag  is  later 
used  to  identify  the  returning  result  message.  The  tag  should  contain  the  thread  number 
of  the  issuing  thread  and  the  register  in  which  to  put  the  result.  This  allows  writing  the 
result  directly  into  the  register  file  through  a  second  write  port  at  shown  in  Figure  8.2. 
By  storing  results  directly  into  the  register  file,  no  special  storage  is  needed  and  they  axe 
immediately  available  upon  resumption  of  the  thread.  Alternatively,  the  second  write  port 
can  be  eliminated  if  the  results  are  buffered  and  later  written  into  the  register  file  during 
cycles  in  which  the  processor  does  not  write  to  the  register  file. 

8.1.3  Scheduling 

The  main  task  of  the  scheduler  is  to  determine  when  a  thread  is  ready.  In  the 
multithreading  models  we  have  looked  at,  a  thread  becomes  ready  when  all  of  its  shared 
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memory  accesses  have  returned  from  the  network.  To  keep  track  of  outstanding  references 
the  scheduler  will  need  a  counter  for  each  thread.  The  counter  is  incremented  on  each 
shared  load  issued  into  the  network  and  is  decremented  upon  its  return.  When  the  counter 
reaches  zero,  the  thread  becomes  ready. 

The  ready  threads  may  be  scheduled  with  any  sort  of  scheduling  policy.  For 
explicit-switch  we  used  first-come-first-serve  (which  is  the  same  as  round  robin  when 
accesses  return  in  order.)  This  is  simple  and  fair.  Other  policies,  such  as  those  studied  in 

Section  5.2.1,  might  provide  some  additional  benefit,  for  instance,  by  causing  timeouts  on 

% 

long  run-lengths.  However  since  long  run-lengths  are  uncommon  under  explicit-switch, 
we  expect  the  benefits  of  more  complex  policies  will  be  small. 

8.1.4  Multiple  Register  Sets 

The  largest  change  to  the  processor  design  in  terms  of  chip  area  is  the  addition  of 
multiple  register  sets.  These  multiple  register  sets  axe  essential  for  fast  context  switching 
because  without  them  it  would  take  at  least  128  cycles  to  save  the  register  set2  for  the 
current  thread  out  to  memory  and  then  load  in  the  register  set  for  the  next  thread.  This 
overhead  for  context  switching  would  overshadow  any  gains  made  from  hiding  the  memory 
latency. 

In  a  typical  RISC  processor  design,  the  register  file  only  occupies  a  few  percent 
of  the  chip  area.  On  the  Stanford  MIPS  processor[PGH+84],  for  example,  the  register  file 
occupied  8.3%  of  the  chip  area.  For  more  recent  processors  with  large  on  chip  caches, 
the  percentage  of  chip  area  used  for  the  register  file  is  even  less.  At  this  size,  providing 
10  register  sets  on  chip,  as  was  found  to  be  sufficient  to  support  the  explicit-switch 
multithreading  model,  is  conceivable.  However,  multithreading  designs  that  allow  hundreds 
of  threads  per  processor,  such  as  TERA.  have  so  far  been  prohibited  from  considering  single 
chip  implementation  because  of  the  size  of  their  register  files.  Other  multithreading  designs, 
such  Monsoon  and  *T[NPA92],  do  not  provide  a  separate  register  set  for  each  thread. 

The  precedent  for  increasing  the  register  file  size  has  already  been  set  by  the 
SPARC[Fuj88j  chip  and  the  Am29000[Man92].  SPARC  has  120  integer  registers,  and  the 
Am29000  has  192.  Most  of  these  are  used  to  provide  register  windows  to  help  speedup 
procedure  call  and  return  by  shifting  to  a  new  register  set  rather  than  saving  and  restor- 


2 Here  we  assume  that  the  register  set  is  32  general  purpose  and  32  floating  registers. 
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mg  registers  to  memory.  However,  the  benefit  of  these  register  windows  is  small  because 
compilers  have  been  able  to  do  a  good  job  of  avoiding  most  register  saves  and  restores. 
Some  researchers  have  therefore  proposed  using  the  register  windows  for  multiple  contexts 
instead[APRIL].  Unfortunately,  as  far  as  multithreading  is  concerned,  the  SPARC  architec¬ 
ture  does  not  provide  register  windows  for  the  floating  registers. 

8.1.5  A  Denser  Register  File 

If  the  multithreading  level  and  the  number  of  register  sets  supported  is  small  (say 

» 

4),  the  chip  area  used  for  the  registers  will  be  comparable  to  that  used  on  the  SPARC  or 
AM29000  chips.  Supporting  M  =  4  is  thus  clearly  reasonable.  However  as  the  number  of 
threads  and  register  sets  is  increased,  the  chip  area  will  become  more  of  a  concern  and  we 
therefore  propose  the  following  design  which  can  be  used  for  a  denser  implementation  of 
the  register  file  for  a  multithreaded  processor. 

The  key  to  this  design  is  that  only  the  register  set  of  the  currently  active  thread  is 
used  by  the  processor.  The  other  register  sets  sit  idle  until  their  thread  is  scheduled  by  the 
processor.  Rather  than  keep  all  register  sets  in  large  multiported  register  cells,  the  inactive 
register  sets  can  be  kept  in  smaller  single  ported  register  sets  until  they  are  needed.  If  the 
single  ported  cells  are  implemented  as  dynamic  memory,  then  on  a  VLSI  chip  they  will 
require  less  than  one  twelfth  the  area  of  a  regular  multiported  static  cell3. 

The  main  obstacle  in  implementing  this  is  being  able  to  switch  to  a  new  active 
register  set  quickly  at  a  context  switch.  At  a  context  switch,  the  entire  contents  of  the 
active  register  set  must  be  saved  into  the  inactive  storage  area,  and  the  register  set  of  the 
next  thread  must  be  loaded.  For  a  register  file  of  32  registers,  each  of  which  is  32  bits, 
this  constitutes  1024  bits  that  must  be  saved  and  another  1024  bits  to  be  loaded.  If  done 
quickly,  i.e.  in  parallel,  this  can  be  done  in  two  cycles.  In  the  first  cycle  all  1024  bits  are 
transferred  out  of  the  active  register  file  into  the  inactive  register  file,  and  in  the  second 
cycle  the  new  1024  bits  are  transferred  in.  Moving  1024  bits  into  (or  out  of)  the  register 
file  in  a  single  cycle  would  require  1024  wires,  and  this  would  be  unwieldy. 

This  obstacle  can  be  overcome  by  interlacing  the  inactive  register  file  within  the 
active  register  file  as  shown  in  Figure  8.4.  This  shows  a  single  block  that  can  be  used  in  an 
array  of  32  by  32  blocks  to  implement  the  full  collection  of  active  and  inactive  register  sets. 

3  A  static  cell  with  2  read  and  1  write  ports  can  be  implemented  in  64A  x  41A.  A  dynamic  cell  in 
10.5A  x  18A[Waw91], 
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Figure  8.4:  One  bit  of  register  file  supporting  12  threads  per  processor. 


The  figure  shows  12  single  ported  dynamic  register  cells  for  the  inactive  register  sets  and 
2  multiported  register  cells  for  the  active  register  sets.  Usually  just  one  of  the  two  active 
register  sets  is  used,  except  that  when  then  processor  is  in  transition  from  one  thread  to 
the  next,  instructions  from  both  threads  are  in  the  pipeline  and  thus  both  register  sets  are 
needed. 

The  multiported  register  cells  are  used  in  an  alternating  fashion  as  shown  in  Fig¬ 
ure  8.5.  This  example  shows  the  transition  from  running  one  thread  to  the  next  and  then 
to  a  third  (shown  in  white,  grey,  and  black  respectively).  At  the  start,  the  white  thread  has 
been  executing  out  of  the  A-registers,  and  the  B-registers  have  been  loaded  with  the  register 
set  for  the  grey  thread.  When  the  white  thread  context  switches,  the  registers  for  the  grey 
thread  are  available  and  it  can  start  executing  immediately.  The  A-registers  are  retained 
until  ail  of  the  white  thread’s  instructions  have  exited  the  pipeline  (except  the  switch  in¬ 
struction  which  does  not  use  any  registers).  At  this  point,  the  A-registers  are  written  into 
the  dynamic  memory  storage  area  used  for  the  inactive  threads.  The  A-registers  are  now 
loaded  with  the  register  values  of  the  black  thread.  By  alternating  between  the  A-registers 
and  the  B-registers,  the  active  registers  can  always  be  kept  available. 

This  technique  allows  enough  registers  for  twelve  contexts  to  be  implemented  in 
the  space  that  would  normally  be  needed  for  three.  This  compact  register  file  keeps  the  bus 
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loading  and  lengths  of  the  register  read  and  writes  busses  within  acceptable  bounds.  As  the 
number  of  register  sets  is  increased,  this  design  becomes  more  desirable  because  additional 
register  sets  can  be  added  using  only  one  twelfth  the  area  that  would  be  required  for  adding 
a  multiported  static  register  set. 

A  potential  complication  arises  because  dynamic  registers  are  often  difficult  to 
use.  In  a  dynamic  register,  the  value  is  set  by  trapping  a  charge  on  a  small  capacitor.  This 
small  capacitor  has  limited  driving  capacity,  its  charge  is  destroyed  when  it  is  read,  and 
it  slowly  leaks  and  therefore  must  be  refreshed  periodically.  These  are  characteristics  are 
all  acceptable  for  use  in  our  register  file  design.  The  limited  driving  capacity  is  acceptable 
because  only  a  small  number  of  cells  are  on  any  wire  and  these  wires  all  short.  Destructive 
read  is  acceptable  because  register  values  are  re-written  when  a  thread  completes  its  active 
phase.  And  refreshing  is  not  needed  because  the  threads  and  their  registers  are  constantly 
being  cycled  through  the  processor. 

A  minor  limitation  of  this  design  is  that  there  is  a  minimum  period  after  a  context 
switch  before  the  processor  can  context  switch  again.  This  minimum  period  is  4  cycles  for 
a  5  stage  pipeline  like  the  MIPS  R3000  and  is  shown  in  Figure  8.5.  The  grey  thread  (which 
uses  register  bank  B)  executes  for  the  minimum  period  of  4  cycles,  during  which  the  A 
register  bank  is  in  use  every  cycle  either  by  the  white  or  black  thread  or  for  saving  and 
restoring  registers.  This  4  cycle  minimum  on  the  context  switch  interval  should  not  pose  a 
problem  since  context  switches  rarely  occur  this  frequently. 

8.2  Hardware  for  Switch-On-Miss 

This  section  describes  the  additions  and  changes  to  our  multithreaded  processor 
that  are  needed  to  support  the  switch-on-miss  multithreading  model.  A  revised  processor 
datapath  is  shown  in  Figure  8.6.  The  main  change  is  the  addition  of  a  cache  for  shared 
data  and  support  for  cache  coherency.  Since  the  processors  are  multithreaded,  the  caches 
must  be  lock  up  free  so  that  they  can  continue  operating  while  misses  are  being  serviced  in 
the  network. 

8.2.1  Cache  Coherency 

Supporting  cache  coherency  is  complex[HLRW92],  and  some  machines,  such  as 
CRAY’S  parallel  vector  computers  and  the  TERA  computer,  choose  to  put  their  complexity 


Figure  8.6:  Datapath  with  changes  for  switch-on-miss  multithreading. 
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This  table  is  the  address  table  shown  in  Figure  8  6  A  sim;i 
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Application 

(multithreading) 

Efficiency 
switch  cycles: 
on  miss  =  3 
on  timeout  =  0 

Efficiency 
switch  =  8  cycles 

Loss 

sieve  (mt  =  1) 

89 

86 

3% 

blkmat  (mt  =  3) 

79 

77 

2% 

sor  (mt  =  4) 

88 

85 

3% 

ugray  (mt  =  3) 

88 

85 

3% 

water  (mt  =  3) 

91 

90 

1% 

locus  (mt  =  2) 

83 

78 

5% 

mp3d  (mt  =  11) 

86 

77 

11% 

barnes  (mt  =  2) 

82 

80 

2% 

Table  8.2:  Switch-On-Miss:  Performance  loss  with  8  cycle  context  switch. 

If,  however,  the  context  switch  is  caused  by  a  timeout  rather  than  a  cache  miss, 
the  context  switch  can  be  performed  at  the  start  of  the  pipeline  rather  than  from  deep 
within  it.  This  means  that  a  context  switch  caused  by  a  timeout  can  be  fully  pipelined  and 
thus  without  any  wasted  cycles. 

Table  8.2  shows  the  performance  loss  if  all  context  switches  take  8  cycles  instead 
of  either  3  or  0  as  just  explained.  The  experimental  parameter  are  otherwise  the  same  as  in 
the  studies  of  switch-on-miss  (refer  to  Table  5.6.  Excluding  mp3d,  which  does  not  cache 
well,  the  performance  loss  due  to  the  longer  context  switch  is  typically  2%  or  3%  with  the 
worst  case  being  5%  for  locus.  In  fact,  the  performance  loss  is  somewhat  overstated  here 
because  of  the  timeout  switches.  Other  scheduling  policies  that  introduce  fewer  spurious 
context  switches  would  be  able  to  mitigate  the  performance  loss  further. 

These  results  show  that  a  fast  pipelined  context  switch  provides  only  small  perfor¬ 
mance  gains  over  a  less  aggressive  implementation  that  drains  the  pipeline  before  staring 
the  next  thread.  This  smaller  performance  impact  of  context  switch  time,  compared  to  that 
under  explicit-switch,  results  from  the  longer  run-lengths  between  context  switches. 

8.3  Conclusions  and  Extension  to  Multiprogramming 

In  this  chapter  we  have  indicated  that  the  hardware  mechanisms  needed  to  build 
an  explicit-switch  or  switch-on- miss  multithreaded  processor  are  reasonable.  We  have 
presented  them  as  modifications  and  additions  to  a  simple  RISC  processor,  and  we  have  sug¬ 
gested  that  multithreaded  processors  might  be  designed  by  modifying  current  microproces- 
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sor  designs.  Some  researchers  have  recently  proposed  that  in  the  future  even  uniprocessors 
should  be  multithreaded[CGL92,  FP91,  LGH92]  because  of  increasing  memory  latencies 
and  the  increased  difficulty  of  scheduling  deeper  and  wider  pipelines. 

Some  of  the  simplicity  of  the  multithreading  support  hardware  presented  in  this 
chapter  comes  from  the  fact  the  we  have  only  considered  the  parallel  machine  to  be  running 
a  single  application  at  a  time.  If  instead  we  had  tried  to  provide  a  more  general  parallel 
processor,  such  as  TERA[ACC+90],  where  each  processor  could  be  shared  by  threads  from 
several  different  programs,  there  would  have  been  additional  hardware  complexities.  TERA 
allows  up  to  128  threads  per  processor  and  16  simultaneously  executing  programs.  The 
large  number  of  threads  requires  a  very  large  register  file.  And  the  simultaneous  execution 
of  multiple  programs  requires  memory  protection  to  protect  all  the  threads  from  each  other. 

Multiprogramming  is  a  very  desirable  attribute  of  a  parallel  machine.  At  times  the 
machine  must  clearly  be  devoting  100%  of  its  resources  to  solving  a  single  large  problem, 
for  otherwise  such  a  large  machine  would  not  be  needed.  But  often  the  machine  must 
support  the  simultaneous  development  and  testing  of  applications  by  many  programmers. 
We  believe  the  approach  taken  by  the  CM-5  is  a  good  compromise.  The  CM-5  allows  the 
machine  to  be  partitioned  into  smaller  machines  for  independent  use,  and  it  also  allows  the 
entire  machine  to  be  time  sliced  between  applications.  This  time  slicing  can  be  done  at 
intervals  in  the  range  of  seconds,  to  limit  the  throughput  loss  due  to  stopping  the  machine, 
draining  the  network,  and  switching  to  a  new  process. 

Partitioning  the  machine  allows  a  real  time  user  to  collar  a  portion  of  the  machine, 
whereas  time  slicing  allows  multiple  application  developers  to  share  the  machine  while 
testing  their  applications  at  the  full  machine  size.  Under  these  policies,  the  individual 
processors  are  always  executing  just  a  single  program  at  a  time,  and  thus  the  processor 
model  presented  in  this  chapter  is  adequate  for  building  real  parallel  machines. 

An  additional  benefit  of  multithreading  over  current  parallel  machines  is  that  the 
number  of  processors  assigned  to  an  executing  program  can  be  easily  changed.  For  example, 
consider  a  program  that  is  being  run  at  a  multithreading  level  of  M  =  2  and  is  using  all 
1024  processors.  If  a  second  program  is  started,  the  first  program  can  be  compressed  down 
to  use  512  processors  at  M  =  4,  or  256  processors  at  M  =  8.  Without  multithreading  it  is 

difficult  to  change  the  number  of  processors  since  somehow  the  running  program  must  be 
reconfigured  to  use  fewer  threads. 
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Conclusions  and  Future  Directions 
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because  context  switches  occur  frequently  and  often  very  close  together.  This  means  that  a 
large  number  of  threads  will  be  needed  to  hide  the  latency,  and  because  of  the  many  short 
run-lengths,  sometimes  the  latency  still  will  not  be  completely  hidden. 

Explicit-switch  improves  performance  by  introducing  an  explicit  context  switch 
instruction.  This  instruction  can  be  used  by  an  optimizing  compiler  to  group  together 
independent  shared  memory  references.  This  grouping  allows  a  single  thread  to  issue  mul¬ 
tiple  references  into  the  network  before  switching  to  another  thread.  The  run-lengths  are 
increased,  and  the  distributions  are  significantly  improved  by  the  elimination  of  most  short 
run-lengths. 

Our  results  show  that  explicit-switch  is  able  to  tolerate  latencies  of  200  cycles  by 
using  a  multithreading  level  of  10  threads  (or  less)  per  processor,  and  that  this  is  sufficient 
to  allow  all  of  the  applications  studied  to  obtain  efficiencies  of  80%. 

However,  since  there  is  no  caching  of  shared  memory  in  these  systems,  all  shared 
references  are  sent  across  the  network,  and  thus  the  resulting  network  bandwidth  demands 
can  be  quite  high.  These  bandwidth  demands  vary  considerably  across  the  applications 
and  range  as  high  as  30  bits/operation.  We  expect  that  providing  such  high  bandwidth  will 
be  expensive,  and  thus  conclude  that  although  multithreaded  systems  without  shared  data 
caching  can  achieve  high  execution  efficiencies,  they  may  not  be  cost  effective  because  of 
their  high  bandwidth  requirements. 

Caching  was  effective  for  all  but  one  of  our  applications  (which  has  since  been 
rewritten  to  improve  its  caching  behavior).  For  the  rest  of  the  applications,  caching  was 
able  to  reduce  the  average  bandwidth  requirement  to  under  2  bits/operation.  This  large 
reduction  in  bandwidth  suggests  that  the  cost  and  complexity  of  maintaining  coherent 
caches  on  a  large  machine  will  be  justified  by  the  savings  afforded  from  use  of  a  skinnier 
network. 

Caching  is  also  beneficial  in  that  it  filters  out  many  of  the  remote  references  and 
thus  eliminates  many  potential  long  latency  operations.  Typical  miss  rates  ranged  from  1% 
to  4%.  With  so  many  fewer  remote  references,  the  impact  of  long  latency  is  diminished. 
Our  results  show  that  with  latencies  of  200  cycles,  execution  efficiencies  of  60%  to  70%  can 
be  achieved  without  multithreading,  and  that  with  a  multithreading  level  of  3  threads  per 
processor,  efficiencies  can  be  raised  to  80%  to  90%. 

Multithreading  systems  both  with  and  without  caching  are  able  to  achieve  effi¬ 
ciencies  of  80%.  The  advantage  of  caching  is  that  it  reduces  the  multithreading  level  and 
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the  amount  of  network  bandwidth  that  is  required. 

While  most  of  our  experiments  have  assumed  that  there  will  be  adequate  network 
bandwidth  available,  we  also  looked  at  the  impact  on  performance  of  having  limited  band¬ 
width,  as  will  be  the  case  for  real  networks.  Our  results  for  large  (256  processor)  systems 
(using  both  multithreading  and  caching)  show  that  network  traffic  will  be  bursty.  If  the  sys¬ 
tem  is  to  have  at  most  minor  performance  degradation,  then  the  network  will  need  to  supply 
a  remote  reference  bandwidth  of  2  to  4  bits/operation  and  a  memory  module  bandwidth 
of  8  to  16  bits/operation.  The  higher  memory  module  bandwidth  is  necessary  because  of 
random  hot  spot  congestion. 

9.2  Future  Directions 

We  expect  in  the  future  that  the  latency  problem  will  continue  to  increase.  Proces¬ 
sors  will  continue  to  get  faster,  and  because  of  both  higher  clock  rates  and  superscalaring, 
they  will  issue  remote  references  at  higher  rates.  Ever  larger  parallel  machines  will  also  be 
desired  and  these  will  have  larger  networks  and  longer  latencies.  These  faster  processors 
and  larger  networks  will  both  contribute  to  an  increased  need  for  latency  tolerance. 

Without  caching,  our  simulations  show  that  tolerating  a  200  cycle  latency  requires 
10  threads  per  processor.  We  expect  that  the  number  of  threads  needed  will  grow  at  least 
linearly  with  latency,  and  thus  at  significantly  larger  latencies,  the  number  of  threads  needed 
may  grow  prohibitively  large.  An  important  aspect  of  these  machines  that  needs  further 
research  is  the  amount  of  inter-block  grouping  that  can  be  obtained  by  a  smart  compiler. 
Our  estimates  in  Section  4.3.2  suggests  that  research  in  this  area  should  be  successful. 

With  caching,  our  simulations  in  Section  7.4  show  that  with  longer  latencies  the 
performance  of  a  single  threaded  processor  drops  to  30%  efficiency  at  a  latency  of  1,000 
cycles.  At  this  point  multithreading  is  very  beneficial  and  can  double  the  performance  to 
60%  efficiency,  while  still  using  just  a  moderate  number  of  threads  per  processor. 

In  the  introduction  of  this  dissertation  we  listed  six  mechanisms  for  reducing  the 
impact  of  memory  latency:  caching,  multithreading,  weak  consistency,  prefetching,  layout, 
and  aggregation.  In  this  dissertation  have  focused  primarily  on  the  first  three  techniques. 
These  are  the  most  hardware  oriented  and  the  most  broadly  applicable,  but  there  is  po¬ 
tential  benefit  from  all  of  these  mechanisms,  and  we  suggest  future  research  should  look  at 
exploiting  ail  of  these  techniques. 
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Appendix  A 

Distribution  Function  Histograms 


rather  height.  histogram  that  represents  data  primarily  by  area 
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evenly  over  the  range  from  2000  to  2100  ea  h  •  '  rM'  eng,hs  were  spread  out 

-  *»*  aad  the  ^  ^  *  *»  of 

however,  they  constitute  a  sirable  amount  and  should  d^"'"  ***"• 

points  from  2000  to  2100  are  all  so  clo  r  .  e  y  aPPear  as  such.  Since  the 

*»  *"  a  single  pile  constitl  “  Z  'T^  ^  “  *  ~ 

Wng  the  sand  analogy,  i,  ZZZZ  7  ""  2°°0  *  2>°°- 

overlap,  the  sand  should  simply  pile  up  higher  ZZ  1  ““  W°“W 

Figure  A  .2  shows  an  example  of  this  merging, .  There  7  *°  t0gether- 

aach  pile  constitutes  5%  of  the  data.  The  first  dusted  T  ““  fiV<!  Pte!  each-  “d 

firs,  cluster  ,s  the  points  {4,5,6,7,8},  and  these 
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Figure  A.3:  Example  histogram  showing  a  uniform  distribution  over  the 
domain  [1,100]. 


points  are  far  enough  apart  on  the  logarithmic  scale  that  the  piles  remain  independent. 
The  second  cluster  of  points  is  {11,12,13,14,15}.  These  piles  are  starting  to  merge,  but 
are  still  distinguishable.  The  third  cluster  of  points  is  {21,22,23,24,25},  and  these  are  close 
enough  together  that  they  have  almost  merged  into  a  single  pile.  Finally,  the  fourth  cluster 
of  points  at  {101,102,103,104,105}  are  so  close  together  that  the  merged  pile  looks  identical 
to  a  single  pile  of  25%.  The  important  point  is  that  each  of  the  four  piles  comprises  a  total 
of  25%  and  uses  the  same  amount  of  ink  on  the  page.  Where  there  is  room  to  distinguish 
the  individual  components,  this  is  done,  and  when  there  is  not  enough  room,  the  pile  are 
aggregated  into  a  larger  pile. 

A.3  Mismatch  with  Old  Intuition 

Unfortunately,  if  we  are  familiar  with  the  look  of  a  distribution  when  plotted  on 
linear  axis,  it  will  appear  distorted  when  plotted  in  this  new  format.  A  strong  example  of 
this  is  shown  in  Figure  A.3.  Here  we  show  a  uniform  distribution  where  each  point  from  1 
to  100  has  1%  of  the  run-lengths.  On  linear  axis  this  would  appear  flat.  But  in  our  new 
format,  the  tighter  spacing  at  the  high  end  does  not  allow  enough  room  to  show  the  many 
small  piles  independently.  These  adjoining  piles  are  aggregated  together  and  the  graph 
rises. 

Despite  this  mismatch  with  our  old  intuition,  in  most  cases  these  graphs  provide 
a  compact  and  clear  understanding  of  the  distributions  that  we  will  see  in  this  dissertation. 
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Appendix  B 
Simulator 


B.l  Introduction 

Simulation  is  an  essential  tool  in  the  process  of  computer  design.  While  the  speed 

of  simulation  has  always  been  a  concern,  i,  is  of  critical  concern  when  simulating  parallel 

machines  because  of  the  increased  computational  power  of  these  machines.  The  arithmetic  is 

obvious:  simulating  one  second  of  execution  of  aone  MIP  uniprocessor  rentes  simulating 

one  rndhon  instructions,  bu,  simulating  one  second  of  execution  of  a  thousand  processor 

parallel  machine  requires  simulating  one  billion  instructions.  Most  simulation  based  research 

P  ing  limited  in  scope  and  accuracy  by  the  speed  of  their  simulators[BR92  GHG+91 

ON90],  Faster  simulators  allow  larger  and  more  re^stic  simulations  to  be  performed  and 

help  speed  up  the  experimental  process  by  allowing  more  rapid  feedback  of  simulation 
results. 

Our  simulation  system,  FAST  (Fast  Accurate  Simulation  Tool),  has  a  simulation 
slowdown  ranging  from  10  to  100.  This  slowdown  factor  is  the  average  number  of  cycles  i, 
takes  simulate  a  single  cycle  of  execution  for  a  single  processor.  I,  varies  based  on  the 
application  program  being  simulated.  Applications  with  more  frequent  references  to  shared 
memory  interact  with  the  simulator  more  frequently  and  therefore  take  longer  to  simulate 
omparable  simulation  systems  such  as  that  of  0’Krafka[0’K89)  or  Tango[DGH91]  have 
reported  slowdowns  of  2,000  and  500-6,000  respectively. 

FAST  was  developed  for  the  purpose  of  studying  large  shared  memory  multipro- 
cessors  with  hundreds  or  thousands  of  processors,  and  to  run  real  applications  on  these 
imulated  machines.  To  support  our  simulation  studies  of  such  large  systems,  we  needed  a 
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We  „  ^  ‘eChni'I“e  °f  driVen  s,'’"“'‘-“»"[C+88]  is  the  foundation  of  FAST 

We  are  not  concpniPfl  fi,  •  i  •  1 . 

— -  £  ~  :r;r rr0;:r  - 

of  th;w — -  •“  -  - — . .  ^ZZ 

program  is  augmented  with  additional  instnictions  whidt^^^Uackof  a  th  a*>*>^Ca^on 

and  return  con, to,  to  the  simuia.or  a, 

memory.  The  ne,  result  is  that  most  ins, ructions  are  d^e Z^T-  ‘I  *"* 

In  this  research  we  have  extended  tho  f 

«  extended  the  idea  of  execution  driven  simulator, 

-era,  new  technics  tha,  have  aliowed  up  to  buiid  a  simulator  ,ha,“  “ 

ore  accurate  than  previous  comparable  simulators.  d 

B.1.1  Overview 

The  remainder  of  this  appendix  is  broien  into  five  sections  Section  B  2  disc 
several  previous  simulators  and  their  tradeoffs  in  „„c„ 

Section  B.3  presents  an  overview  of  FAST  Section  B  TexT  ^ 

Of  execution  driven  simulation  Section  B  5  ,  °"  ““  **  ““ 

,  beCtl°n  B-5  reP°rts  Performance  results.  And  section  B  * 

summanzes  and  suggests  directions  for  future  research. 

B.2  Previous  Simulators  and  Tradeoffs 

There  have  been  an  enormous  number  of  simulation  systems  written  for  vaF 
purposes.  Here  we  fnmc  nn,f  .  J  wriuen  tor  various 

the  same  purpose:  simulating  largeThareinnemory  m:i^t^u■ocMlM'seat<t^e^nsUuctioMeveI, 

.ooh  a,  tw!;;:::;  :;r  perf~  in  terms  of  ,wr  swd°™  -  - 1 

of  M  actual  ;:r^:r,ahTs°:e;s  ,h:  drto  wuch  ,im^ — *■« 

interleaved  and  simulated  in  an  accurate  Wl“Ct  ^  m™0ry  ” 
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B.2.1  Cycle-by-Cycle  Simulators 

The  most  straight  forward  type  of  simulator  to  build  is  one  that  cycles  through  the 
parallel  processors,  simulating  one  instruction  at  a  time  from  each  of  the  processors.  Two 
examples  are  the  simulator  by  0’Krafka[0’K89],  which  we  are  more  familiar  with  since  this 
was  done  at  Berkeley,  and  ASIM(refer  to  the  description  in  [Del91])  developed  at  MIT  as 
part  of  the  Alewife  project.  These  simulators  are  slow  because  they  are  essentially  assembly 
language  interpreters.  The  reported  slowdown  factor  for  O’Krafka’s  simulator  is  2,000,  and 
for  ASJM  it  is  reported  as  ranging  from  200-5,000.  Cycle- by- cycle  simulators  are  accurate 
in  interleaving  global  events  since  they  simulate  the  entire  machine  one  cycle  at  a  time, 
but  they  may  be  inaccurate  in  instruction  timing  (as  is  O’Krafka’s  simulator)  because  it  is 
complex  and  time  consuming  to  accurately  model  the  processor’s  pipeline. 

The  performance  of  these  cycle-by-cycle  simulators  is  dominated  by  instruction 
interpretation  since  this  is  done  for  every  single  cycle  of  the  executed  program.  Interesting 
events,  like  shared  memory  references,  occur  less  frequently. 

B.2.2  Execution  Driven  Simulators 

Execution  driven  simulation  can  be  substantially  faster  than  a  cycle-by-cycle  sim¬ 
ulator  because  it  eliminates  the  instruction  interpretation  portion  of  the  simulator.  Instead, 
control  is  handed  over  to  the  augmented  program  which  executes  for  several  cycles  before 
encountering  an  event  of  interest  and  returning  control  to  the  simulator.  The  simulated 
processor  has  now  advanced  its  private  clock  past  those  of  other  simulated  processors.  Ac¬ 
curate  event  interleaving  dictates  that  the  event  should  not  be  processed  immediately,  but 
rather  it  must  be  scheduled  and  executed  once  the  entire  global  state  has  advanced  to  the 
event’s  time  step.  This  means  that  instead  of  cycling  between  the  simulated  processors  on 
a  cycle  by  cycle  basis,  it  is  sufficient  to  cycle  between  them  at  each  event  (as  long  as  the 
events  are  then  queued  and  later  executed  at  their  proper  times). 

The  Tango  simulator[DGH91]  developed  at  Stanford  is  an  execution  driven  simu¬ 
lator.  It  is  based  on  Unix  shared  memory  and  uses  Unix  context  switches  in  order  to  switch 
from  executing  one  processor  to  another.  These  heavy  weight  context  switches  however 
require  thousands  of  cycles,  and  thus  they  slow  the  simulator  substantially  if  it  switches  at 
every  event  in  order  to  accurately  interleave  them.  For  accurate  simulations  they  report 
slowdown  factors  ranging  from  500  to  6000.  Because  of  this  large  cost  of  context  switching, 
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they  provide  an  option  to  tradeoff  accuracy  for  faster  execution  by  letting  the  individual 
processor  clocks  get  out  of  sync  and  not  trying  to  accurately  interleave  the  shared  mem¬ 
ory  references.  They  have  recently  rewritten  their  simulator  to  use  a  light  weight  thread 
package,  which  should  significantly  reduce  the  magnitude  of  their  context  switch  overhead 
problem. 

The  Proteus  simulator  developed  at  MIT[BDCW91,  Del91]  is  another  execution 
driven  simulator.  It  does  use  a  light  weight  thread  package,  and  is  substantially  faster 
than  Tango.  They  report  typical  slowdown  factors  ranging  from  35  to  100.  However  they 
have  a  substantial  accuracy  problem  in  their  instruction  timing  because  they  do  not  apply 
code  augmentation  at  a  consistent  low  level.  They  replace  shared  memory  references  in 
the  C  source  code  with  calls  to  the  simulation  routines  (and  optionally  also  insert  statistics 
gathering  calls.)  They  then  compile  this  modified  code  and  apply  code  augmentation  for 
timing  on  the  assembly  language.  Because  each  shared  reference  (which  should  be  just  a 
single  instruction)  is  replaced  with  a  procedure  call,  the  compiler  optimizations  that  can  be 
applied  and  the  object  code  produced  are  substantially  changed  from  that  which  would  have 
been  produced  if  the  original  code  were  compiled  directly.  In  fact,  their  good  performance 
is  partially  due  to  the  fact  that  their  insertion  of  procedure  calls  causes  the  compiler  to  save 
away  important  registers,  and  thus  allows  them  to  “exploit  ‘partial’  context  switches”  in 
which  they  only  save  a  limited  amount  of  the  register  file.  This  is  good  for  performance, 
but  bad  for  timing  accuracy. 

B.2.3  Tradeoffs 

We  have  identified  the  following  five  tradeoffs  in  simulator  design: 

Performance:  Execution  driven  simulation  is  the  most  important  factor  in  building  a 
fast  simulator  because  otherwise  the  interpretation  of  individual  instructions  is  the 
dominant  cost.  The  next  most  important  factor  is  fast  context  switching  between 
the  simulated  processors  because  frequent  context  switching  is  required  to  accurately 
order  global  events. 

Accuracy:  Performing  all  code  augmentation  at  the  assembly  language  level  is  necessary 
for  accurate  instruction  timing.  Any  source  code  modifications  that  change  the  code 
generated  by  the  compiler  affect  the  compiler’s  optimization  ability  and  the  thus 
accuracy  of  instruction  level  timing.  Switching  between  simulated  processors  at  all 
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globals  events  is  required  in  order  to  obtain  a  correct  global  ordering.  If  context 
switching  is  expensive,  then  the  simulator  writer  or  user  is  tempted  to  trade  accuracy 
for  performance  by  context  switching  less  often. 

Source  Alteration:  Ideally  the  source  code  should  be  compiled  and  optimized  in  its  orig¬ 
inal  form  as  it  would  be  written  for  a  shared  memory  multiprocessor.  However,  all 
of  these  simulators  require  some  source  changes.  Proteus  is  the  most  egregious  and 
requires  new  operators  be  used  for  all  shared  memory  references.  O’Krafka’s  simula¬ 
tor  and  Tango  both  disallow  static  shared  variables,  and  thus  all  such  variable  must 
be  allocated  dynamically  and  referenced  indirectly  through  pointers.  FAST  only  re¬ 
quires  minor  syntactic  changes1  that  have  no  affect  on  the  instructions  generated  by 
the  compiler. 

Modularity:  All  of  the  simulators  have  similar  modularity.  Each  allows  selecting  and 
mixing  different  modules  for  different  aspects  of  the  machine:  such  as  the  cache  and 
the  interconnection  network.  Normally  this  is  done  by  linking  the  modules  together, 

but  Tango  also  has  the  option  (at  substantial  performance  cost)  of  using  distinct  Unix 
processes. 

Portability:  Portability  is  poor  for  all  of  these  systems  because  they  are  tied  to  the  instruc¬ 
tion  set  that  they  are  designed  for.  Direct  execution  simulators  must  be  run  on  that 
specific  type  of  machine,  but  cycle-by-cycle  simulators,  since  they  are  interpreters,  can 
use  cross-compiled  applications  and  be  run  on  any  machine.  Porting  execution  driven 
simulators  to  a  new  machine  involves  changing  the  code  augmentation  to  understand 
the  new  machine’s  instruction  set.  The  actual  simulators  are  all  written  in  high  level 
languages  and  should  presumably  be  portable. 

Based  on  an  understanding  of  these  tradeoffs,  we  have  built  our  FAST  simulation 
system  so  that  it  is  faster,  more  accurate,  and  uses  less  mutative  source  alterations.  It  has 
similar  modularity  and  portability  as  in  the  other  simulators  discussed. 


B.3  Simulator 
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Figure  B.l  shows  a  diagram  of  using  FAST.  First  the  application  program  to  be 
simulated  is  compiled  with  full  optimization  just  as  it  would  be  for  a  real  parallel  processor, 
and  then  it  is  linked  with  any  libraries  that  it  uses,  such  as  math  routines. 

The  linked  object  code  module  is  then  read  into  the  code  modifier  which  performs 
the  various  code  augmentations  (which  will  be  discussed  in  the  next  section).  It  is  im¬ 
portant  that  augmentation  be  done  on  library  functions  since  some  applications  use  these 
extensively.  System  calls  are  not  handled,  but  these  usually  do  not  occur  in  the  parallel 
computation  phases  of  the  parallel  applications  that  we  have  studied. 

The  modified  code  is  then  linked  with  the  simulator  and  selected  modules  that 
simulate  the  caches,  network,  and  scheduler.  A  large  number  of  these  modules  have  been 
written,  and  they  can  be  selected  based  on  what  is  of  interest  to  the  user.  For  caching  there 
are  modules  for  various  cache  configurations  and  protocols,  or  for  no  caching  at  all.  For 
networks  the  simulator  is  usually  used  with  a  simple  constant  time  network  approximation, 
but  it  has  also  been  used  with  a  detailed  simulator  of  packet  switched  networks.  The  sched¬ 
uler  module  is  used  for  multithreading  studies  and  implements  simple  scheduling  policies 
such  as  FIFO,  or  more  complex  policies  like  priority  scheduling  or  timeouts. 

The  single  executable  file  produced  includes  the  simulator,  the  various  modules, 
and  the  modified  application  code.  When  it  is  run,  the  simulator  starts  first.  It  reads  in  a 
simulation  parameter  file  that  specifies  the  number  of  processors,  level  of  multithreading, 
network  latency,  and  other  parameters.  It  then  calls  initialization  routines  for  the  various 
modules,  and  then  starts  up  and  manages  the  execution  driven  simulation  of  the  application 
program. 


The  core  of  the  simulator  is  a  simple  time  wheel  scheduler.  This  is  just  a  linear 
array  with  one  slot  per  time  step  (modulo  the  array  size),  where  each  slot  points  to  a  linked 
list  of  events  that  will  occur  at  that  time  step.  The  simulator  operates  by  removing  an 
event  at  the  current  time  step,  simulating  it  (using  execution  driven  simulation),  and  then 
placing  the  resulting  event  into  the  proper  slot  to  be  executed  in  the  future.  This  is  very 
efficient  since  there  is  no  polling  to  test  for  ready  events.  For  simulations  of  large  parallel 
machines,  there  are  so  many  events  that  typically  every  slot  has  one  or  more  events  in  it. 
The  average  cost  of  scheduling  an  event  is  thus  very  small. 
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B.4  Code  Augmentation 

Code  augmentation  is  the  process  of  taking  an  original  piece  of  code  and  adding 

to  it  and/or  modifying  it  so  that  it  can  perform  additional  functions.  Traditionally  it  has 

been  used  for  the  following  three  purposes: 

Time  C  ounting:  Instructions  are  added  to  the  each  basic  block  so  that  when  that  block 
is  executed,  the  extra  instructions  increment  a  time  counter  with  an  amount  corre¬ 
sponding,  to  the  number  of  cycles  required  for  the  processor  to  execute  the  original 
basic  block.  This  is  the  basic  code  augmentation  that  is  used  in  all  execution  driven 
simulators. 

Statistics  Gathering:  Instructions  are  added  to  gather  statistics  such  as  counts  of  the 
number  of  times  that  certain  pieces  of  code  are  executed.  This  is  the  basis,  of  execution 
driven  profilers,  such  as  the  MIPS  pixie  program[MIP86]. 

Event  Call-Outs:  At  special  events,  such  as  shared  memory  references,  code  is  inserted 
to  call  out  to  the  simulator  in  order  to  let  the  simulator  regain  control  and  process 
the  event.  This  is  used  in  a  simplified  form  when  debuggers  create  breakpoints  by 
replacing  the  instruction  at  the  breakpoint  with  a  trap  instruction. 

In  this  section  we  extend  the  idea  of  augmentation  with  several  new  uses: 

In-line  Context  Switching:  The  augmented  code  typically  runs  for  just  a  small  number 
cycles  before  reaching  an  event  and  returning  control  to  the  simulator.  During  this 
execution  only  a  small  subset  of  the  register  file  is  ever  accessed,  and  therefore  it  is 
wasteful  to  actually  load  and  store  the  entire  register  set.  We  use  code  augmentation 
to  load  and  store  register  values  at  basic  block  boundaries  so  that  only  the  used  and 
modified  registers  are  loaded  and  stored. 

Reference  Indirection:  For  a  single  threaded  program,  which  the  compiler  thinks  it  is 
compiling,  static  local  variables  are  assigned  to  fixed  memory  addresses.  However, 
for  a  parallel  program,  each  thread  needs  its  own  copy  of  these  variables.  Our  code 
augmenter  converts  these  references  into  indirect  references  into  the  executing  thread’s 

context  block  which  contains  the  thread’s  local  state:  register  values,  local  variables, 
and  stack. 
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Dynamic  Reference  Discrimination:  We  suggested  in  Section  2.1.3  that  a  compiler, 
with  proper  language  support,  should  be  able  to  statically  identify  all  memory  accesses 
as  going  to  either  local  or  shared  memory.  Since  we  do  not  have  languages  and 
compilers  that  support  this,  we  have  added  code  augmentation  to  check  address  ranges 
at  execution  time  and  determine  if  a  pointer  is  to  shared  or  local  memory.  Optionally, 
this  reference  classification  information  can  be  collected  as  a  trace  file  on  the  first 
run  of  an  application  and  then  fed  back  into  the  code  modifier  to  do  complete  static 
classification.2 

Re-Optimization:  During  our  studies  of  multithreading  we  found  it  important  to  group 
shared  memory  load  instruction  together.  We  implemented  this  within  the  code  mod¬ 
ifier  by  reordering  instruction  and  percolating  shared  memory  load  instructions  up 
towards  the  tops  of  basic  blocks. 

Extended  Instruction  Sets:  For  the  most  part  we  accepted  the  instruction  set  of  the  pro¬ 
cessor  on  which  simulations  were  being  executed:  the  MIPS  R3000[Kan89].  However 
we  did  want  to  add  a  number  of  new  instructions  such  as:  double  word  load  and  stores, 
local  and  shared  memory  versions  of  all  loads  and  stores,  an  explicit  thread  switch  in¬ 
struction,  fetch-and-add,  and  other  special  synchronization  instructions.  These  were 
all  added  by  having  the  code  modifier  convert  these  into  calls  to  special  simulator 
routines. 

Virtual  Registers:  On  of  the  most  useful  new  code  augmentations  is  virtualization  of  the 
register  file.  This  simplified  implementation  of  the  other  code  augmentations  because 
it  eliminated  concerns  about  remapping  registers.  This  will  discussed  more  fully  at 
the  end  of  this  section. 

B.4.1  An  Example 

Figure  B.2  shows  an  example  of  code  augmentation  for  a  small  code  fragment 
which  will  be  used  to  demonstrate  several  of  the  code  augmentations  described  above. 
The  original  assembly  language  instructions  are  shown  in  part  (a);  the  modified  code  is 
shown  in  part(b).3  These  instructions  were  generated  by  the  compilation  of  the  expression 

2This  accurate  static  classification  is  required  for  our  re- optimization  of  the  code. 

3The  instruction  set  is  approximately  that  of  the  MIPS  R3000[Kan89],  but  it  has  been  simplified  slightly 
to  make  the  example  clearer. 
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code  for:  A  = 

B  +  C  +  X 

where:  A 

is  variable  in  shared  memory 

B,C 

are  variables  in  local  memory 

X 

is  variable  in  register  r8 

registers:  Rgp  -  global  pointer 

Rsbp  ■  shared  base  pointer 
Rep  -  context  pointer 
Rtime  « time  value 

simulator  interface: 

simulator_sw(r4  ■  address,  r5  -  value) 


.  lw 

r8,  of f set_of_r8 (Rep) 

lw 

n. 

local  addr  of  B (Rgp) 

lw 

rl,  local  addr  of  B(Rcp) 

lw 

r2. 

local  addr  of  C(Rgp) 

lw 

r2,  local  addr  of  C(Rcp) 

add 

r3. 

rl,  r8 

add 

r3,  rl,  r8 

add 

r3. 

r3,  r2 

add 

r3,  r3,  r2 

sw 

r3. 

shared_addr_of_A(Rgp) 

sw 

rl,  offset_of_rl (Rep) 

sw 

r2,  offset_of_r2 (Rep) 

sw 

r3,  offset_of_r3 (Rep) 

add! 

Rtime,  Rtime,  4 

addl 

r4,  Rsbp,  shared  addr of  A 

lw 

r5,  offset  of  r3 (Rep) 

\  addl 

Rtime,  Rtime,  1 

\  call 

slmulator_sw 

load  used  registers 


save  modified  registers 
accumulate  time 

call  out  to  simulator 


(a)  original  code 


(b)  modified  code 


Figure  B.2:  Example  of  code  augmentation 


A  =  B  +  C  +  X,  where  the  variables  B  and  C  will  be  loaded  from  local  memory,  the 
variable  X  is  already  in  register  r8,  and  the  result  A  will  be  stored  in  shared  memory. 
Assume  for  this  example  that  this  expression  by  itself  forms  a  basic  block.  Basic  blocks  are 
the  granularity  at  which  we  perform  analysis  and  code  augmentation,  and  thus  this  small 
basic  block  can  serve  as  a  complete  example. 

The  first  step  is  to  identify  which  instructions  can  be  directly  executed  by  the  host 
processor  and  which  instructions  will  require  call-outs  to  the  simulator.  In  this  example  the 
last  instruction  references  shared  memory  and  thus  will  be  replaced  with  a  call-out.  The 
other  four  instructions  are  local  to  the  processor  and  can  be  directly  executed.  For  ease  of 
manipulation,  the  call-out  instruction  is  isolated  into  its  own  basic  block,  as  indicated  by 
the  horizontal  lines  separating  the  instructions. 

The  second  step  is  to  calculate  the  timing  of  the  basic  blocks.  The  first  block  has 
four  instructions  and  takes  four  cycles.  The  second  block  has  one  instruction  and  takes  one 
cycle4.  The  timing  of  each  basic  block  is  computed  statically  and  is  used  in  the  inserted 


4  In  general  determining  accurate  timing  is  somewhat  more  complicated  because  the  processor  pipeline 
must  be  modeled.  Usually  looking  just  within  a  basic  block  is  adequate,  but  sometimes  long  latency  floating 
point  operations  continue  executing  past  the  end  of  a  basic  block  and  aifect  the  timing  of  subsequent  blocks. 
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instructions  which  accumulate  the  running  execution  time  in  register  Rtima. 

The  third  step  is  reference  indirection.  The  loads  of  local  variables  B  and  C  are 
originally  relative  to  the  global  pointer  (register  Rgp).  These  are  changed  to  be  thread 
relative  by  indexing  off  of  the  thread  context  pointer  (register  Rep).5 

Step  four  involves  adding  code  for  in-line  context  switching.  In  our  implemen¬ 
tation,  we  maintain  the  invariant  condition  that  between  basic  blocks  all  register  values 
shouM  be  correctly  stored  in  the  context  block  of  the  executing  thread.  In  our  system  this 
context  block  is  pointed  to  by  the  Rep  register,  and  thus  register  load  and  stores  are  relative 
to  this  pointer. 

At  the  start  of  each  basic  block  we  insert  code  to  load  the  registers  whose  values 
will  be  used.  In  the  example,  only  the  value  in  register  r8  is  used.  The  registers  rl,  r2 
and  r3  also  appear,  but  they  do  not  need  to  be  loaded  since  their  original  values  are  not 
used.  At  the  end  of  each  basic  block  we  append  code  to  store  any  registers  who’s  values 
have  been  redefined.  In  the  example  this  is  rl,  r2  and  r3. 

This  completes  the  code  augmentation  for  the  first  basic  block.  The  second  basic 
block  is  the  save  word  instruction  (sw)  that  originally  saved  the  value  in  register  r3  to  an 
address  in  shared  memory.  It  is  replaced  by  a  sequence  of  instructions  which  load  parameters 
and  then  call-out  to  a  simulation  routine  to  perform  the  shared  memory  operation.  The 
address  and  data  values  are  loaded  into  the  argument  registers  (r4  and  r5),  and  the  time 
counter  (Rtime)  is  incremented  by  1  (the  time  taken  by  the  original  instruction).  If  the 
simulator  finds  that  more  time  would  be  needed  by  this  instruction,  for  instance  if  the 
memory  network  is  clogged  or  there  is  a  cache  miss,  the  simulator  would  add  the  additional 
time. 


This  completes  the  code  augmentation.  The  code  has  now  been  converted  so  that 
it  is  context  block  relative.  The  simulator  can  now  switch  threads  by  changing  the  context 
pointer  and  time  counter  and  then  jumping  into  the  new  thread  to  be  executed. 


If  these  subsequent  blocks  are  selected  by  conditional  branches,  the  exact  timing  will  depend  upon  the 
branch  paths  taken  at  execution  time.  These  cases  are  rare,  and  for  our  simulator  we  use  timings  based  on 
the  statically  predicted  most  likely  branch  paths. 

‘Here  reference  indirection  is  simply  changing  from  Rgp  to  Rep  and  possibly  changing  the  offset.  It  is 
more  involved  when  the  original  reference  is  not  relative  to  Rgp. 
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B.4.2  Virtual  Registers 

The  technique  of  in-line  context  switching  usually  leaves  most  register  values  in 
the  context  block,  and  this  motivated  the  idea  of  virtualizing  the  register  file.  When  register 
r8  was  loaded  and  later  used  in  Figure  B.2(b),  it  could  have  been  loaded  into  any  physical 
register  as  long  as  the  register  later  used  in  the  add  instruction  was  also  changed  to  the 
same  register.  Thus  the  virtual  registers  used  in  the  original  code  need  not  be  the  same  as 
the  physical  registers  used  in  an  expanded  basic  block.  Different  basic  blocks  could  choose 
to  use  different  physical  registers  to  hold  the  virtual  register  r8. 

The  benefit  of  this  is  that  we  can  now  have  more  virtual  registers  than  there  are 
physical  registers.  For  instance  we  have  used  virtual  registers  Rtime,  Rep,  and  Rsbp  in 
our  modified  code.  The  mapping  between  virtual  and  physical  registers  is  possible  as  long 
as  each  individual  basic  block  does  not  use  more  virtual  registers  than  there  are  physical 
registers  to  map  into.  Mapping  problems  are  rare  and  occur  only  for  large  basic  blocks,  and 
they  are  easily  handled  by  splitting  these  large  blocks  into  multiple  smaller  blocks. 

This  virtualization  of  the  register  file  actually  simplifies  other  code  augmentations. 
For  instance  in  the  old  style  of  code  augmentation,  some  specific  physical  register,  say  r30, 
is  used  for  time  counting.  Thus  wherever  r30  is  used  in  the  original  code,  the  code  must 
be  modified  to  work  around  the  usurpation  of  this  register. 

Virtual  register  have  many  potential  uses.  One  example  use  was  in  a  research 
project  that  tried  to  improve  memory  reference  patterns  by  re-optimizing  basic  blocks  in 
order  to  group  together  shared  memory  load  instructions.  This  re-optimization  needed  a 
few  extra  temporary  registers  to  allow  reordering  of  instructions  while  still  preserving  all 
data  dependencies,  and  these  extra  registers  were  made  available  as  extra  virtual  registers. 


B.5  Performance 

In  this  section  we  discuss  three  aspects  of  the  performance  of  our  simulator:  the 
cost  of  in-line  context  switching,  the  slowdown  factors  of  basic  simulations,  and  the  affects 
on  slowdown  when  simulating  multithreading  or  caching. 

B.5.1  Cost  of  In-line  Context  Switching 
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Application 

Description 

Context 
switch  cost 

Average 

interval 

between 

switches 

Amortized 
cost  per 
instruction 

switch 

in 

switch 

out 

sieve 

finds  primes 

9.8 

7.9 

7.0 

2.5 

blkmat 

blocked  matrix  multiply 

47.7 

50.3 

48.0 

2.0 

sor 

solves  Laplace’s  equation 

8.5 

5.5 

4.2 

3.3 

ugray 

ray  tracing  Tenderer 

11.8 

9.1 

10.1 

2.1 

water 

system  of  water  molecules 

27.7 

22.2 

33.1 

1.5 

locus 

standard  cell  router 

8.0 

5.2 

4.0 

3.3 

mp3d 

rarefied  hypersonic  flow 

8.1 

6.3 

4.7 

3.1 

Table  B.l:  Context  Switch  Costs 


Table  B.l  shows  the  effectiveness  of  in-line  context  switching.  It  gives  the  context 
switch  frequency  and  the  average  context  switch  costs  for  the  applications  that  we  have 
used  in  our  simulation  studies. 

The  switch  in  cost  listed  in  the  table  is  the  average  number  of  registers  loaded  per 
context  switch  into  the  application  from  the  simulator.  The  switch  out  cost  is  the  average 
number  of  registers  saved  per  context  switch  from  the  application  out  to  the  simulator. 
Recall  that  these  register  loads  and  stores  do  not  all  occur  at  the  points  of  context  switching 
between  the  simulator  and  threads,  but  are  spread  among  the  prefixes  and  suffixes  of  the 
sequence  of  basic  blocks  executed  between  context  switches.  Also  included  in  these  context 
switch  costs  are  the  overheads  incurred  by  the  simulator  in  saving  and  restoring  reserved 
registers  such  as  the  program  counter,  time  counter,  stack  pointer  and  context  pointer. 

The  column  labeled  average  interval  between  switches  shows  the  average  number  of 
simulated  cycles  between  context  switches.  For  those  applications  that  context  switch  most 
frequently,  the  context  switch  cost  is  less  than  10  cycles.  The  locus  program,  for  example, 
accesses  shared  memory  very  frequently  and  thus  context  switches  at  an  average  rate  of 
once  every  four  cycles.  The  average  cost  of  these  context  switches  is  8.0  cycles  to  switch  in 
and  5.2  cycles  to  switch  out.  In  all  cases,  the  context  switch  cost  is  less  than  the  size  of  the 
register  set6.  In  comparison,  the  light-weight  thread  package  used  in  Proteus[Del91]  loads 
and  stores  the  entire  register  set  and  takes  135  cycles  per  context  switch. 

In  our  system,  the  cost  of  context  switching  is  roughly  proportional  to  the  fre¬ 
quency  of  occurrence.  The  longer  an  application  executes,  the  more  registers  it  is  likely  to 

6  On  a  Mips  processor  there  are  29  integer,  32  floating  point  and  3  special  purpose  registers  in  the  usable 
register  set. 
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Figure  B.3:  Simulation  slowdown 

use.  Tie  blkmat  and  »at.r  applications,  for  example,  context  switch  less  frequently  than 
the  other  applications  and  their  average  context  switch  costs  are  higher.  However  since 
they  do  not  context  switch  as  frequently,  the  higher  context  switch  costs  are  amortized  over 

a  longer  period.  Overall,  the  total  context  switch  overhead  ranges  from  2  to  3  cycles  per 
simulated  cycle. 


B.5.2  Slowdowns  Factors  for  Basic  Simulations 

Figure  B.3  shows  the  performance  of  the  FAST  simulator  on  the  various  benchmark 
applications.  Results  are  shown  with  the  number  of  processors  varied  from  1  to  1024.  The 
slowdown  factors  shown  in  this  graph  are  the  number  of  cycles  taken  to  simulate  a  single 
cycle  of  a  single  thread.  Since  most  instructions  are  directly  executed  and  the  context 
switching  cost  has  been  reduced  to  just  2  to  3  cycles  per  simulated  cycle,  one  might  expect 
slowdown  factors  of  3  or  4.  The  slowdowns  are  larger  because  of  the  remaining  overhead 
which  comes  from  the  scheduling  mechanism  within  the  simulator,  the  simulation  of  shared 
references,  the  memory  simulator,  and  statistics  gathering.  For  this  graph  the  memory 
model  is  a  simple  ideal  memory  that  has  0  latency  and  no  contention. 

Two  interesting  trends  can  be  observed  from  this  graph.  First,  the  slowdowns  vary 
for  different  programs.  Programs  such  as  blfanat  and  water  have  typical  slowdowns  from 
10  to  30,  while  programs  such  as  locus  and  sor  have  typical  slowdowns  from  60  to  100. 
The  difference  comes  from  the  different  frequencies  at  which  the  applications  interact  with 
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Figure  B.4:  Simulation  slowdowns  under  different  configurations. 


the  simulator.  Sor  and  locus  had  context  switches  every  4  cycles  compared  to  blkmat  and 
water  which  have  context  switches  only  every  30  to  50  cycles  and  thus  require  much  less 
scheduling  by  the  simulator.  The  cost  of  simulated  events  is  amortized  over  a  larger  number 
of  instructions,  and  thus  the  overall  slowdown  factors  for  blkmat  and  water  are  lower  than 
those  for  the  other  applications. 

The  second  interesting  trend  is  that  as  the  number  of  processors  is  increased,  the 
slowdown  factor  initially  drops  and  then  slowly  rises.  The  initial  decrease  in  slowdown  is 
due  to  the  time  wheel  algorithm  used  to  schedule  threads  and  events.  It  works  best  when 
there  are  many  processors  and  thus  there  are  many  events  per  cycle.  The  later  increase  in 
the  slowdown  factor  occurs  because  the  applications  use  more  synchronization  operations  as 
the  number  of  processors  is  increased.  Synchronization  operations,  especially  spinning  on 
locks  or  barriers,  involve  many  shared  accesses  and  thus  increase  the  work  of  the  simulator. 


B.5.3  Multithreading  and  Caching 

FAST  was  designed  in  a  modular  fashion  and  can  be  configured  to  perform  a 
wide  variety  of  different  simulations  depending  upon  what  is  of  interest  to  the  researcher 
conducting  the  simulation  studies.  The  main  uses  of  the  simulator  have  been  for  studies  of 
multithreading  under  long  memory  latencies  and  for  performance  studies  of  cache  coherency 
protocols. 

Figure  B.4  shows  the  performance  of  the  simulator  under  three  configurations: 
the  ideal  case  which  has  0  latency,  the  multithreading  case  which  has  200  cycle  latency  and 
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several  threads  per  processor,  and  the  caching  case  which  uses  a  cache  simulator  of  the 
Censier  and  Feautrier[CF78]  directory  based  cache  coherence  protocol.  The  ideal  case  and 
the  multithreading  case  have  roughly  the  same  performance.  This  occurs  because  studying 
multithreading  was  one  of  the  primary  intended  uses  of  FAST,  and  thus  multithreading 
support  was  built  in  from  the  start.  Single  threaded  execution  is  simply  a  special  case 
of  multithreading  in  which  there  is  just  one  thread  per  processor.  The  cache  simulator 
typically  takes  hundreds  of  cycles  per  reference  to  check  and  manipulate  the  caches’  states, 
and  this  extra  overhead  slows  the  simulations.  The  change  in  performance  is  moderated 
by  the  fact  that  the  cache  simulation  cost  is  amortized  over  the  total  number  of  simulated 
cycles. 


B.6  Summary  and  Future  Research 

We  have  used  FAST  to  perform  a  large  number  of  architectural  simulations.  Its 
fast  speed  has  allowed  us  to  simulate  larger  problems  and  larger  machines  than  would  have 
been  possible  with  previous  comparable  simulators.  Execution  driven  simulation  is  the  most 
important  technique  for  obtaining  high  performance. 

However  speed  is  just  one  important  aspect  of  FAST .  By  carefully  understanding 
the  tradeoffs  in  design  choices,  we  have  been  able  to  build  a  simulator  that  is  also  more 
accurate  than  previous  instruction  level  simulators.  The  most  important  point  is  that  code 
augmentation  must  be  applied  at  a  low  level  since  source  code  alterations  can  perturb  the 
object  code  produced  and  thus  the  accuracy  of  instruction  level  timings.  A  second  point 
is  that  accurate  interleaving  of  global  events  requires  frequent  context  switching  between 
simulated  processors,  and  thus  fast  context  switching  is  desirable. 

In  building  FAST,  we  have  extended  the  idea  of  code  augmentation  into  a  number 
of  new  areas  such  as  in-line  context  switching,  re-optimization,  extended  instruction  sets, 
and  virtualization  of  the  register  file.  These  extensions  have  been  important  in  making 
the  right  design  tradeoffs  so  as  to  obtain  both  high  performance  and  high  accuracy,  and  in 
making  a  simulator  that  is  flexible  enough  to  be  used  for  a  large  variety  of  experiments. 

There  are  several  possible  directions  for  future  research  with  FAST  or  similar 
simulators.  First,  since  we  are  simulating  a  shared  memory  multiprocessor,  it  should  be 
possible  to  speed  up  the  simulator  be  executing  it  in  parallel  on  today’s  small  shared  memory 
multiprocessors  in  order  to  simulate  tomorrow’s  larger  machines.  The  main  problem  that 
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will  arise  is  synchronizing  and  correctly  interleaving  the  concurrent  simulations  of  multiple 
processors. 

Second,  FAST  would  be  a  good  foundation  for  a  parallel  program  development 
and  debugging  system.  Simulators  are  useful  for  debugging  because  they  can  reproduce 
identical  timing  races  on  subsequent  runs.  The  Proteus[Del91]  simulator  provides  a  powerful 
monitoring  facility  by  inserting  monitoring  code  into  the  source  code  of  applications,  and  we 
would  like  to  see  if  similar  mechanisms  could  be  built  without  modifying  the  applications 
source  code. 

Third,  our  new  augmentation  techniques  of  virtualizing  the  register  file  and  ex¬ 
tending  the  instruction  set  could  be  used  along  with  a  modified  compiler  to  study  various 
architectural  changes  such  as  larger  register  files  or  new  instructions. 


